Skip to content

Commit 778bb56

Browse files
authored
Merge release v0.1.16
Release v0.1.16
2 parents 2d15105 + 40776cc commit 778bb56

File tree

23 files changed

+475
-555
lines changed

23 files changed

+475
-555
lines changed

.gitmodules

Lines changed: 0 additions & 3 deletions
This file was deleted.

.markdownlint.jsonc

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
{
2+
// Default state for all rules
3+
"default": true,
4+
5+
// MD007/ul-indent Unordered list indentation
6+
"MD007": {
7+
// Spaces for indent
8+
"indent": 4
9+
},
10+
11+
// MD013/line-length - Line length
12+
"MD013": {
13+
// Number of characters
14+
"line_length": 900,
15+
// Number of characters for headings
16+
"heading_line_length": 80,
17+
// Number of characters for code blocks
18+
"code_block_line_length": 500 // some example console output is wide
19+
},
20+
21+
// MD046/code-block-style - Code block style
22+
// Disable consistency checks between fenced/indented code blocks.
23+
// Standard code blocks should use fences, while mkdocs admonitions require
24+
// 4-space indented blocks.
25+
"code-block-style": false
26+
}
27+

Makefile

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Runs a container that lints all markdown (.md) files in the project.
2+
# Uses the markdownlint-cli (https://github.com/igorshubovych/markdownlint-cli).
3+
# Rules for markdownlint package can be found here:
4+
# https://github.com/DavidAnson/markdownlint/blob/main/doc/Rules.md
5+
markdownlint:
6+
docker run --rm --name markdownlint \
7+
--volume ${PWD}:/workdir \
8+
ghcr.io/igorshubovych/markdownlint-cli:latest \
9+
--config .markdownlint.jsonc --ignore venv "**/*.md"
10+

README.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,25 @@ $ . venv/bin/activate
1818
(venv) $ pip install -r mkdocs_requirements.txt
1919
```
2020

21-
### Run mkdocs Server
21+
### Run mkdocs or mike Server
2222

2323
To run mkdocs server locally, execute `mkdocs serve`. The output will appear similar to below, with the localhost URL listed at the end.
2424

2525
```bash
26-
(venv) $ mkdocs serve
26+
(venv) $ venv/bin/mkdocs serve
2727
INFO - Building documentation...
2828
[...]
2929
INFO - Documentation built in 0.22 seconds
3030
INFO - [10:59:28] Watching paths for changes: 'docs', 'mkdocs.yml'
3131
INFO - [10:59:28] Serving on http://127.0.0.1:8000/
3232
```
33+
34+
Or run `mike serve`.
35+
36+
```bash
37+
(venv) $ venv/bin/mike serve
38+
Starting server at http://localhost:8000/
39+
Press Ctrl+C to quit.
40+
CStopping server...
41+
```
42+

docs/guides/data-movement/copy-offload-api.html

Lines changed: 0 additions & 1 deletion
This file was deleted.
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# Copy-Offload
2+
3+
The copy-offload API allows a user's compute application to specify [Data Movement](../data-movement/readme.md) requests. The user's application utilizes the `libcopyoffload` library to establish a secure connection to the copy-offload server to initiate, list, query the status of, or cancel data movement requests. The copy-offload server accepts only those requests that present its Workflow's token.
4+
5+
The copy-offload server is implemented as a special kind of [User Container](../user-containers/readme.md). Like all user containers, this is activated by a `DW container` directive in the user's job script and runs on the Rabbit nodes that are associated with the compute nodes in the user's job.
6+
7+
## Administrative Configuration
8+
9+
### TLS signing key and certificate
10+
11+
A signing key and self-signed TLS certificate must be created and made available to the copy-offload server and the certificate must also be copied to each compute node. This certificate must have a SAN extension that describes all of the Rabbit nodes.
12+
13+
Tools are available to assist in creating this certificate and its signing key. Begin by confirming that the cluster's `SystemConfiguration` resource can be accessed using the `kubectl` command. This resource contains the information about all of the Rabbit nodes and is used when creating the SAN extension for the certificate:
14+
15+
```console
16+
kubectl get systemconfiguration
17+
```
18+
19+
Run `tools/mk-usercontainer-secrets.sh` from either the `nnf-deploy` workarea or from a gitops repo derived from the [argocd boilerplate](https://github.com/NearNodeFlash/argocd-boilerplate).
20+
21+
```console
22+
tools/mk-usercontainer-secrets.sh
23+
```
24+
25+
That tool creates the signing key and the certificate and stores them in a Kubernetes secret named `nnf-dm-usercontainer-server-tls`. This first secret is mounted into the copy-offload server's pod when it is specified in a user's job script. The certificate is also stored by itself in a Kubernetes secret named `nnf-dm-usercontainer-client-tls`. The content of this second secret can be retrieved by the administrator and copied to each compute node.
26+
27+
```console
28+
CLIENT_TLS_SECRET=nnf-dm-usercontainer-client-tls
29+
kubectl get secrets $CLIENT_TLS_SECRET -o json | jq -rM '.data."tls.crt"' | base64 -d > cert.pem
30+
```
31+
32+
!!! info
33+
34+
Copy the certificate to `/etc/nnf-dm-usercontainer/cert.pem` on each compute node. It must be readable by all users' compute applications.
35+
36+
### Library libcopyoffload
37+
38+
The [`libcopyoffload` library](https://github.com/NearNodeFlash/nnf-dm/tree/master/daemons/lib-copy-offload) must be made available on the compute nodes and the developer environments for users to use with their applications.
39+
40+
### WLM and the per-Workflow token
41+
42+
!!! note
43+
44+
The following must be handled by the WLM service. There is nothing here for the adminstrator to do.
45+
46+
The WLM, such as Flux, must retrieve the per-Workflow token and make it available to the user's compute application as an environment variable named `DW_WORKFLOW_TOKEN`. The token is used by the `libcopyoffload` library to construct the "Bearer Token" for its requests to the copy-offload server. The token becomes invalid after the Workflow enters its teardown state.
47+
48+
The Workflow contains a reference to the name of the Secret that holds the token. The following value returns the name and namespace of the secret:
49+
50+
```console
51+
kubectl get workflow $WORKFLOW_NAME -o json | jq -rM '.status.workflowToken'
52+
```
53+
54+
If information about the token's secret is returned, then read the token from the given secret:
55+
56+
```console
57+
TOKEN=$(kubectl get secret -n $SECRET_NAMESPACE $SECRET_NAME -o json | jq -rM '.data.token' | base64 -d)
58+
```
59+
60+
Create the environment variable for the user's compute application:
61+
62+
```bash
63+
DW_WORKFLOW_TOKEN="$TOKEN"
64+
```
65+
66+
!!! note
67+
68+
Per-Workflow tokens are not limited to the copy-offload API. Any user container may request to be configured with the job's per-Workflow token and the TLS certificate. See `requires=user-container-auth` in [User Containers](../user-containers/readme.md). The WLM must always check for the existence of a token secret in the Workflow.
69+
70+
## User Enablement of Copy Offload
71+
72+
Users enable the copy-offload server by requesting it in their job script. The script must contain a `#DW container` directive that specifies the desired copy-offload container profile. At least one of the `#DW jobdw` or `#DW persistentdw` directives in the job script must include the `requires=copy-offload` statement. See [User Interactions](../user-interactions/readme.md) for more details about these directives.
73+
74+
The user's compute application must be linked with the `libcopyoffload` library. This library understands how to find and use the TLS certificate and the per-Workflow token required for communication with the copy-offload server for the user's job.
75+
76+
The copy-offload container profile is specified in the `container` directive. See [User Containers](../user-containers/readme.md) for details about using container profiles. The following directives show that the job uses copy-offload and select the default copy-offload container profile:
77+
78+
```bash
79+
#DW jobdw name=my-job-name requires=copy-offload [...]
80+
#DW container name=copyoff-container profile=copy-offload-default [...]
81+
```
82+
83+
!!! info
84+
85+
See [User Containers](../user-containers/readme.md) for details about customizing the directives and the container profile for the storage resources created by the Workflow.
86+
87+
### Use libcopyoffload
88+
89+
The [`libcopyoffload` library](https://github.com/NearNodeFlash/nnf-dm/tree/master/daemons/lib-copy-offload) must be linked into the user's compute application. See its header file and associated test tool for a description, and example usage, of the API.
90+
91+
## Certificate and Per-Workflow Token Details
92+
93+
The per-Workflow token and its signing key are created during the Workflow's `Setup` state, and they are destroyed when the Workflow enters `Teardown` state.
94+
95+
The WLM places the per-Workflow token in an environment variable for the application on the compute node. The variable is named `DW_WORKFLOW_TOKEN`. The application on the compute node can find the TLS certificate in `/etc/nnf-dm-usercontainer/cert.pem`. The `libcopyoffload` library is able to use the per-Workflow token and the TLS certificate to communicate securely with the copy-offload server.
96+
97+
The TLS certficate, its signing key, and the token's signing key, are mounted into the copy-offload server's Pod when it is created during the Workflow's `PreRun` state. The Pod contains the following environment variables which can be used to access the certificate and the signing keys:
98+
99+
| Environment Variable | Value |
100+
|----------------------|-------|
101+
| TLS_CERT_PATH | The pathname to the TLS certificate. |
102+
| TLS_KEY_PATH | The pathname to the signing key for the TLS certificate. |
103+
| TOKEN_KEY_PATH | The pathname to the signing key for the per-Workflow token. |
104+
105+
These pieces are not restricted to the copy-offload API. They can be used by any user container. See `requires=user-container-auth` in [User Containers](../user-containers/readme.md), and [Environment Variables](../user-interactions/readme.md#environment-variables), for details.

docs/guides/data-movement/readme.md

Lines changed: 9 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,10 @@ categories: provisioning
88
Data Movement can be configured in multiple ways:
99

1010
1. Server side (`NnfDataMovementProfile`)
11-
2. Per Copy Offload API Request arguments
11+
2. Copy offload API server
1212

1313
The first method is a "global" configuration - it affects all data movement operations that use a
14-
particular `NnfDataMovementProfile` (or the default). The second is done per the Copy Offload API,
14+
particular `NnfDataMovementProfile` (or the default). The second is done per the `copy offload` API,
1515
which allows for some configuration on a per-case basis, but is limited in scope. Both methods are
1616
meant to work in tandem.
1717

@@ -24,26 +24,17 @@ for understanding how to use profiles, set a default, etc.
2424
For an in-depth understanding of the capabilities offered by Data Movement profiles, we recommend
2525
referring to the following resources:
2626

27-
- [Type definition](https://github.com/NearNodeFlash/nnf-sos/blob/master/api/v1alpha6/nnfdatamovementprofile_types.go#L27) for `NnfDataMovementProfile`
28-
- [Sample](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/samples/nnf_v1alpha6_nnfdatamovementprofile.yaml) for `NnfDataMovementProfile`
27+
- [Type definition](https://github.com/NearNodeFlash/nnf-sos/blob/master/api/v1alpha7/nnfdatamovementprofile_types.go#L27) for `NnfDataMovementProfile`
28+
- [Sample](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/samples/nnf_v1alpha7_nnfdatamovementprofile.yaml) for `NnfDataMovementProfile`
2929
- [Online Examples](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/examples/nnf_nnfdatamovementprofile.yaml) for `NnfDataMovementProfile`
3030

31-
## Copy Offload API Daemon
31+
## Copy Offload API Server
3232

33-
The `CreateRequest` API call that is used to create Data Movement with the Copy Offload API has some
34-
options to allow a user to specify some options for that particular Data Movement operation. These
35-
settings are on a per-request basis. These supplement the configuration in the
36-
`NnfDataMovementProfile`.
33+
The `copy offload` API allows the user's compute application to specify options for particular Data Movement operations. These settings are on a per-request basis and supplement the configuration in the `NnfDataMovementProfile`.
3734

38-
The Copy Offload API requires the `nnf-dm` daemon to be running on the compute node. This daemon may
39-
be configured to run full-time, or it may be left in a disabled state if the WLM is expected to run
40-
it only when a user requests it. See [Compute Daemons](../compute-daemons/readme.md) for the systemd
41-
service configuration of the daemon. See `Requires` in [Directive
42-
Breakdown](../directive-breakdown/readme.md) for a description of how the user may request the
43-
daemon in the case where the WLM will run it only on demand.
35+
The copy offload API requires the `copy-offload` server to be running on the Rabbit node. This server is implemented as a [User Container](../user-containers/readme.md) and is activated by the user's job script. The user's compute application must be linked with the `libcopyoffload` library.
4436

45-
See the [DataMovementCreateRequest API](copy-offload-api.html#datamovement.DataMovementCreateRequest)
46-
definition for what can be configured.
37+
See [Copy Offload](../data-movement/copy-offload.md) for details about the usage and lifecycle of the copy offload API server.
4738

4839
## SELinux and Data Movement
4940

@@ -53,9 +44,7 @@ the compute node, which may not be supported by the destination file system (e.g
5344

5445
Depending on the configuration of `dcp`, there may be an attempt to copy these xattrs. You may need
5546
to disable this by using `dcp --xattrs none` to avoid errors. For example, the `command` in the
56-
`NnfDataMovementProfile` or `dcpOptions` in the [DataMovementCreateRequest
57-
API](copy-offload-api.html#datamovement.DataMovementCreateRequest) could be used to set this
58-
option.
47+
`NnfDataMovementProfile` could be used to set this option.
5948

6049
See the [`dcp` documentation](https://mpifileutils.readthedocs.io/en/latest/dcp.1.html) for more
6150
information.

docs/guides/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212

1313
* [Storage Profiles](storage-profiles/readme.md)
1414
* [Data Movement Configuration](data-movement/readme.md)
15-
* [Copy Offload API](data-movement/copy-offload-api.html)
15+
* [Copy Offload](data-movement/copy-offload.md)
1616
* [Lustre External MGT](external-mgs/readme.md)
1717
* [Global Lustre](global-lustre/readme.md)
1818
* [Directive Breakdown](directive-breakdown/readme.md)

docs/guides/initial-setup/readme.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Instructions for the initial setup of a Rabbit are included in this document.
1616

1717
1. Disable UDEV for LVM
1818
2. Disable UDEV sync at the host operating system level
19-
3. Disable UDEV sync using the `noudevsync` command option for each LVM command
19+
3. Disable UDEV sync using the `--noudevsync` command option for each LVM command
2020
4. Clear the UDEV cookie using the `dmsetup udevcomplete_all` command after the lvcreate/lvremove command.
2121

2222
Taking these in reverse order, using option 4 allows UDEV settings within the host OS to remain unchanged from the default. One would need to start the `dmsetup` command on a separate thread because the LVM create/remove command waits for the UDEV cookie. This opens too many error paths, so it was rejected.

0 commit comments

Comments
 (0)