Skip to content

Commit 2d15105

Browse files
Merge release v0.1.15
Release v0.1.15
2 parents 85fab38 + 9a1aba7 commit 2d15105

File tree

14 files changed

+724
-41
lines changed

14 files changed

+724
-41
lines changed

docs/guides/data-movement/readme.md

Lines changed: 88 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,7 @@ authors: Blake Devcich <[email protected]>
33
categories: provisioning
44
---
55

6-
# Data Movement Overview
7-
8-
## Configuration
6+
# Data Movement Configuration
97

108
Data Movement can be configured in multiple ways:
119

@@ -17,7 +15,7 @@ particular `NnfDataMovementProfile` (or the default). The second is done per the
1715
which allows for some configuration on a per-case basis, but is limited in scope. Both methods are
1816
meant to work in tandem.
1917

20-
### Data Movement Profiles
18+
## Data Movement Profiles
2119

2220
The server side configuration is controlled by creating `NnfDataMovementProfiles` resources in
2321
Kubernetes. These work similar to `NnfStorageProfiles`. See [here](../storage-profiles/readme.md)
@@ -26,11 +24,11 @@ for understanding how to use profiles, set a default, etc.
2624
For an in-depth understanding of the capabilities offered by Data Movement profiles, we recommend
2725
referring to the following resources:
2826

29-
- [Type definition](https://github.com/NearNodeFlash/nnf-sos/blob/master/api/v1alpha1/nnfdatamovementprofile_types.go#L27) for `NnfDataMovementProfile`
30-
- [Sample](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/samples/nnf_v1alpha1_nnfdatamovementprofile.yaml) for `NnfDataMovementProfile`
31-
- [Online Examples](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/examples/nnf_v1alpha1_nnfdatamovementprofile.yaml) for `NnfDataMovementProfile`
27+
- [Type definition](https://github.com/NearNodeFlash/nnf-sos/blob/master/api/v1alpha6/nnfdatamovementprofile_types.go#L27) for `NnfDataMovementProfile`
28+
- [Sample](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/samples/nnf_v1alpha6_nnfdatamovementprofile.yaml) for `NnfDataMovementProfile`
29+
- [Online Examples](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/examples/nnf_nnfdatamovementprofile.yaml) for `NnfDataMovementProfile`
3230

33-
### Copy Offload API Daemon
31+
## Copy Offload API Daemon
3432

3533
The `CreateRequest` API call that is used to create Data Movement with the Copy Offload API has some
3634
options to allow a user to specify some options for that particular Data Movement operation. These
@@ -40,14 +38,14 @@ settings are on a per-request basis. These supplement the configuration in the
4038
The Copy Offload API requires the `nnf-dm` daemon to be running on the compute node. This daemon may
4139
be configured to run full-time, or it may be left in a disabled state if the WLM is expected to run
4240
it only when a user requests it. See [Compute Daemons](../compute-daemons/readme.md) for the systemd
43-
service configuration of the daemon. See `RequiredDaemons` in [Directive
41+
service configuration of the daemon. See `Requires` in [Directive
4442
Breakdown](../directive-breakdown/readme.md) for a description of how the user may request the
4543
daemon in the case where the WLM will run it only on demand.
4644

4745
See the [DataMovementCreateRequest API](copy-offload-api.html#datamovement.DataMovementCreateRequest)
4846
definition for what can be configured.
4947

50-
### SELinux and Data Movement
48+
## SELinux and Data Movement
5149

5250
Careful consideration must be taken when enabling SELinux on compute nodes. Doing so will result in
5351
SELinux Extended File Attributes (xattrs) being placed on files created by applications running on
@@ -62,7 +60,7 @@ option.
6260
See the [`dcp` documentation](https://mpifileutils.readthedocs.io/en/latest/dcp.1.html) for more
6361
information.
6462

65-
### `sshd` Configuration for Data Movement Workers
63+
## `sshd` Configuration for Data Movement Workers
6664

6765
The `nnf-dm-worker-*` pods run `sshd` in order to listen for `mpirun` jobs to perform data movement.
6866
The number of simultaneous connections is limited via the sshd configuration (i.e. `MaxStartups`).
@@ -72,3 +70,82 @@ start rejecting connections once the limit is reached.
7270

7371
The `sshd_config` is stored in the `nnf-dm-worker-config` `ConfigMap` so that it can be changed on
7472
a running system without needing to roll new images. This also enables site-specific configuration.
73+
74+
## Enabling Core Dumps
75+
76+
### Mounting core dump Volumes
77+
78+
First, you must determine how your nodes handle core dumps. For example, if `systemd-coredump` is
79+
used, then core dumps inside containers will be moved to the host node automatically. If that is
80+
not the case, then a directory on the host nodes will need to be mounted into the Data Movement
81+
containers. This directory will contain any core dumps collected by data movement operations, mainly
82+
`mpirun` or `dcp`.
83+
84+
For Data Movement, the pods are running on two types of Kubernetes nodes:
85+
86+
- `nnf-dm-worker` pods on Rabbit nodes
87+
- `nnf-dm-controller` pods on Kubernetes worker nodes
88+
89+
For all of these nodes, a core dump directory will need to be present and consistent across the
90+
nodes. Once in place, we can then edit the Kubernetes configuration to mount this directory from
91+
the host node to the containers using a [`hostPath`
92+
Volume](https://kubernetes.io/docs/concepts/storage/volumes/#hostpath).
93+
94+
Adding this configuration will be done via the gitops repository for the system. Patches will be used
95+
to patch the `nnf-dm` containers to mount the core dump directory via a `hostPath` volume.
96+
97+
An example of this configuration is provided in
98+
[`argocd-boilerplate`](https://github.com/NearNodeFlash/argocd-boilerplate/tree/main/environments/example-env/nnf-dm).
99+
There are two patch files that add Volumes to mount `/localdisk/dumps` from the host node at the
100+
same location inside the containers.
101+
102+
- [`dm-controller-coredumps.yaml`](https://github.com/NearNodeFlash/argocd-boilerplate/blob/main/environments/example-env/nnf-dm/dm-controller-coredumps.yaml)
103+
- [`dm-manager-coredumps.yaml`](https://github.com/NearNodeFlash/argocd-boilerplate/blob/main/environments/example-env/nnf-dm/dm-manager-coredumps.yaml)
104+
105+
[`kustomization.yaml`](https://github.com/NearNodeFlash/argocd-boilerplate/blob/main/environments/example-env/nnf-dm/kustomization.yaml#L13C1-L24C29)
106+
then applies these patches to the correct resources.
107+
108+
### Editing the Data Movement Command
109+
110+
Once the volume is in place, the Data Movement command must be updated to first `cd` into this
111+
directory. This ensures that the core dump is placed in that directory, making it accessible on the
112+
host node.
113+
114+
To achieve this, update the Data Movement profiles in your gitops repository to include a preceding
115+
`cd /localdisk/dumps && ...` in the `command` before the Data Movement command. For example, the default profile in `environments/<system>/nnf-sos/default-nnfdatamovementprofile.yaml` would look like the following:
116+
117+
```yaml
118+
kind: NnfDataMovementProfile
119+
metadata:
120+
name: default
121+
namespace: nnf-system
122+
data:
123+
command: ulimit -n 2048 && cd /localdisk/dumps && mpirun --allow-run-as-root --hostfile $HOSTFILE dcp --progress
124+
1 --uid $UID --gid $GID $SRC $DEST
125+
```
126+
127+
Note that core patterns for containers are inherited from the host and that Linux containers do not
128+
support a container-only core pattern without also affecting the host node. This is why we must use
129+
a preceding `cd <dir>` in the Data Movement command.
130+
131+
### Data Movement Debug Images
132+
133+
To help with debugging symbols, it is a good idea to use the `debug` version of the two images used by the Data Movement containers:
134+
135+
- `nnf-mfu-debug`
136+
- `nnf-dm-debug`
137+
138+
Both of these images include debugging symbols for [Open MPI](https://www.open-mpi.org/) and [mpiFileUtils](https://mpifileutils.readthedocs.io/en/v0.11.1/).
139+
140+
To use these images, edit the `environments/<system>/nnf-dm/kustomization.yaml` in your gitops repository and add the following:
141+
142+
```yaml
143+
# Use images with mpifileutils/mpirun debug symbols
144+
images:
145+
- name: ghcr.io/nearnodeflash/nnf-dm
146+
newName: ghcr.io/nearnodeflash/nnf-dm-debug
147+
- name: ghcr.io/nearnodeflash/nnf-mfu
148+
newName: ghcr.io/nearnodeflash/nnf-mfu-debug
149+
```
150+
151+
This will override the default images and use the debug symbols instead.

docs/guides/directive-breakdown/readme.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -150,14 +150,14 @@ A location constraint consists of an `access` list and a `reference`.
150150
* `status.compute.constraints.location.access.type` specifies the connection type for the storage. This can be `network` or `physical`
151151
* `status.compute.constraints.location.access.priority` specifies how necessary the connection type is. This can be `mandatory` or `bestEffort`
152152

153-
## RequiredDaemons
153+
## Requires
154154

155-
The `status.requiredDaemons` section of the `DirectiveBreakdown` tells the WLM about any driver-specific daemons it must enable for the job; it is assumed that the WLM knows about the driver-specific daemons and that if the users are specifying these then the WLM knows how to start them. The `status.requiredDaemons` section will exist only for `jobdw` and `persistentdw` directives. An example of the `status.requiredDaemons` section is included below.
155+
The `status.requires` section of the `DirectiveBreakdown` tells the WLM about any driver-specific daemons it must enable for the job; it is assumed that the WLM knows about the driver-specific daemons and that if the users are specifying these then the WLM knows how to start them. The `status.requires` section will exist only for `jobdw` and `persistentdw` directives. An example of the `status.requires` section is included below.
156156

157157
```yaml
158158
status:
159159
...
160-
requiredDaemons:
160+
requires:
161161
- copy-offload
162162
...
163163
```

docs/guides/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727

2828
* [Disable or Drain a Node](node-management/drain.md)
2929
* [Debugging NVMe Namespaces](node-management/nvme-namespaces.md)
30+
* [Switch a Node From Worker to Master](node-management/worker-to-master.md)
3031

3132
## Monitoring the Cluster
3233

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# Switch a Node From Worker to Master
2+
3+
In this example, we have htx[40-42] as worker nodes. We will remove htx[40-41] as worker nodes and re-join them as master nodes.
4+
5+
## Remove a k8s worker node
6+
7+
Begin by moving their existing pods to htx42.
8+
9+
Taint the nodes we're going to remove, to prevent new pods from being SCHEDULED on them (this is different from the taint we'll use in a later step):
10+
11+
```console
12+
NODE=htx40
13+
kubectl taint node $NODE cray.nnf.node.drain=true:NoSchedule
14+
```
15+
16+
Set deploy/dws-webhook to 1 replica. **This must be done via the gitops repo.** Edit `environments/$ENV/dws/kustomization.yaml`, and add this, then wait for argocd to put it into effect. Or, force argocd to sync it with `argocd app sync 1-dws`.
17+
18+
```bash
19+
patches:
20+
- target:
21+
kind: Deployment
22+
name: dws-webhook
23+
patch: |-
24+
apiVersion: apps/v1
25+
kind: Deployment
26+
metadata:
27+
name: dws-webhook
28+
spec:
29+
replicas: 1
30+
```
31+
32+
Taint the nodes we're going to remove, to BUMP EXISTING PODS off them (this is different from the taint we used earlier). This will bump any DWS, NNF, ArgoCD, cert-manager, mpi-operator, luster-fs-operator pods. This leaves any lustre-csi-driver pods in place to assist with any Lustre unmounts that k8s may
33+
request.
34+
35+
```console
36+
kubectl taint node $NODE cray.nnf.node.drain=true:NoExecute
37+
```
38+
39+
Decommission [calico node](https://docs.tigera.io/calico/latest/operations/decommissioning-a-node).
40+
41+
> If you are running the node controller or using the Kubernetes API datastore in policy-only mode, you do not need to manually decommission nodes.
42+
43+
Tell k8s to drain the nodes.
44+
45+
Use the cray.nnf.node taints above before running 'kubectl drain'. Those taints allow Workflows to be terminated cleanly, even when they have Lustre filesystems mounted in the pods on that node. It's important that the lustre-csi-driver pod on that node lives long enough to assist with those unmounts to allow K8s to finish pod cleanup.
46+
47+
```console
48+
kubectl drain --ignore-daemonsets $NODE
49+
```
50+
51+
Delete the worker nodes:
52+
53+
```console
54+
kubectl delete node $NODE
55+
```
56+
57+
Verify that the node is deleted from calico and k8s:
58+
59+
```console
60+
kubectl calico get nodes (requires the calico plugin for kubectl)
61+
kubectl get nodes
62+
```
63+
64+
Remove etcd, if it was a master:
65+
66+
```console
67+
(on $NODE) kubeadm reset remove-etcd-member
68+
```
69+
70+
It takes a while for all the containers on the deleted node to stop, so be patient.
71+
72+
```console
73+
(on $NODE) crictl ps
74+
```
75+
76+
Reset everything that "kubeadm join" did to that node:
77+
78+
```console
79+
(on $NODE) kubeadm reset cleanup-node
80+
```
81+
82+
## Join a node as a master
83+
84+
Check for expired "kubeadm init" or "kubeadm-certs" tokens, or expired certs:
85+
86+
The certificate-key from 'kubeadm init' is deleted after two hours. Use "kubeadm init phase upload-certs --upload-certs" to reload the certs later. This is explained in the output of the 'kubeadm init' command.
87+
88+
```console
89+
kubeadm token list
90+
```
91+
92+
The one labeled for "kubeadm init" is used as the token in "kubeadm join" commands. The one labeled for "managing TTL" controls the lifetime of the "kubeadm-certs" secret and the "bootstrap-token-XXX" secret. These secrets and this token, are deleted after the "managing TTL" token expires. A worker can still join after that expires; a master cannot.
93+
94+
```console
95+
kubeadm certs check-expiration
96+
```
97+
98+
Re-join that node as a master. When you ran "kubeadm init" to create the initial master node, you should have saved the output. It contains the "join" command that you need to create new masters. You want the commandline that includes the "--control-plane" option:
99+
100+
```console
101+
(on $NODE) kubeadm join ... --control-plane ...
102+
```
103+
104+
If that fails, it may tell you to generate new certs. Run the 'kubeadm init phase' command it specifies, and note the certificate key in the output. Replace the certificate key from your original join command with this new key and run the new join command.

docs/guides/storage-profiles/readme.md

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -288,21 +288,29 @@ In general, `scale` gives a simple way for users to get a filesystem that has pe
288288

289289
## Command Line Variables
290290

291-
### pvcreate
291+
### global
292+
- `$JOBID` - expands to the Job ID from the Workflow
293+
- `$USERID` - expands to the User ID of the user who submitted the job
294+
- `$GROUPID` - expands to the Group ID of the user who submitted the job
295+
296+
### LVM PV commands
292297

293298
- `$DEVICE` - expands to the `/dev/<path>` value for one device that has been allocated
294299

295-
### vgcreate
300+
### LVM VG commands
296301

297302
- `$VG_NAME` - expands to a volume group name that is controlled by Rabbit software.
298303
- `$DEVICE_LIST` - expands to a list of space-separated `/dev/<path>` devices. This list will contain the devices that were iterated over for the pvcreate step.
304+
- `$DEVICE_NUM` - expands to the count of devices in `$DEVICE_LIST`
299305

300-
### lvcreate
306+
### LVM LV Commands
301307

302308
- `$VG_NAME` - see vgcreate above.
303309
- `$LV_NAME` - expands to a logical volume name that is controlled by Rabbit software.
304310
- `$DEVICE_NUM` - expands to a number indicating the number of devices allocated for the volume group.
305311
- `$DEVICE1, $DEVICE2, ..., $DEVICEn` - each expands to one of the devices from the `$DEVICE_LIST` above.
312+
- `$PERCENT_VG` - expands to the size that each LV should be based on a percentage of the total VG size
313+
- `$LV_SIZE` - expands to the size of the LV in kB in the format expected by `lvcreate`
306314

307315
### XFS mkfs
308316

@@ -326,9 +334,15 @@ In general, `scale` gives a simple way for users to get a filesystem that has pe
326334

327335
- `$FS_NAME` - expands to the filesystem name that was passed to Rabbit software from the workflow's #DW line.
328336
- `$MGS_NID` - expands to the NID of the MGS. If the MGS was orchestrated by nnf-sos then an appropriate internal value will be used.
329-
- `$POOL_NAME` - see zpool create above.
330-
- `$VOL_NAME` - expands to the volume name that will be created. This value will be `<pool_name>/<dataset>`, and is controlled by Rabbit software.
337+
- `$ZVOL_NAME` - expands to the volume name that will be created. This value will be `<pool_name>/<dataset>`, and is controlled by Rabbit software.
331338
- `$INDEX` - expands to the index value of the target and is controlled by Rabbit software.
339+
- `$TARGET_NAME` - expands to the name of the lustre target of the form `[fsname]-[target-type][index]` (e.g., `mylus-OST0003`)
340+
- `$BACKFS` - expands to the type of file system backing the Lustre target
341+
342+
### Mount/Unmount
343+
344+
- `$DEVICE` - expands to the device path to mount
345+
- `$MOUNT_PATH` - expands to the path to mount on
332346

333347
### PostMount/PreUnmount and PostActivate/PreDeactivate
334348

@@ -343,3 +357,10 @@ These variables are for lustre only and can be used to perform PostMount activit
343357
- `$NUM_MGTMDTS` - expands to the number of combined MGTMDTs for the lustre filesystem
344358
- `$NUM_OSTS` - expands to the number of OSTs for the lustre filesystem
345359
- `$NUM_NNFNODES` - expands to the number of NNF Nodes for the lustre filesystem
360+
361+
### NnfSystemStorage specific
362+
363+
- `$COMPUTE_HOSTNAME` - Expands to the hostname of the compute node that will use the allocation. This can be used to add a tag during the lvcreate
364+
```
365+
lvCreate --zero n --activate n --extents $PERCENT_VG --addtag $COMPUTE_HOSTNAME ...
366+
```

docs/guides/user-containers/readme.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -52,9 +52,9 @@ The next few subsections provide an overview of the primary components comprisin
5252
aspects, they don't encompass every single detail. For an in-depth understanding of the capabilities
5353
offered by container profiles, we recommend referring to the following resources:
5454

55-
- [Type definition](https://github.com/NearNodeFlash/nnf-sos/blob/master/api/v1alpha1/nnfcontainerprofile_types.go#L35) for `NnfContainerProfile`
56-
- [Sample](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/samples/nnf_v1alpha1_nnfcontainerprofile.yaml) for `NnfContainerProfile`
57-
- [Online Examples](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/examples/nnf_v1alpha1_nnfcontainerprofiles.yaml) for `NnfContainerProfile` (same as `kubectl get` above)
55+
- [Type definition](https://github.com/NearNodeFlash/nnf-sos/blob/master/api/v1alpha6/nnfcontainerprofile_types.go#L35) for `NnfContainerProfile`
56+
- [Sample](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/samples/nnf_v1alpha6_nnfcontainerprofile.yaml) for `NnfContainerProfile`
57+
- [Online Examples](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/examples/nnf_nnfcontainerprofiles.yaml) for `NnfContainerProfile` (same as `kubectl get` above)
5858

5959
#### Container Storages
6060

@@ -597,7 +597,7 @@ The following profile shows the placement of the `readonly-red-rock-slushy` secr
597597
in the previous step, and points to the user's `dean/red-rock-slushy:v1.0` container.
598598

599599
```yaml
600-
apiVersion: nnf.cray.hpe.com/v1alpha1
600+
apiVersion: nnf.cray.hpe.com/v1alpha6
601601
kind: NnfContainerProfile
602602
metadata:
603603
name: red-rock-slushy
@@ -635,7 +635,7 @@ insert two `imagePullSecrets` lists into the `mpiSpec` of the NnfContainerProfil
635635
launcher and the MPI worker.
636636

637637
```yaml
638-
apiVersion: nnf.cray.hpe.com/v1alpha1
638+
apiVersion: nnf.cray.hpe.com/v1alpha6
639639
kind: NnfContainerProfile
640640
metadata:
641641
name: mpi-red-rock-slushy

0 commit comments

Comments
 (0)