NearNodeFlash
diff --git a/‎docs/guides/data-movement/readme.md
Lines changed: 88 additions & 11 deletions b/‎docs/guides/data-movement/readme.md
Lines changed: 88 additions & 11 deletions
diff --git a/‎docs/guides/directive-breakdown/readme.md
Lines changed: 3 additions & 3 deletions b/‎docs/guides/directive-breakdown/readme.md
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/guides/index.md
Lines changed: 1 addition & 0 deletions b/‎docs/guides/index.md
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/guides/node-management/worker-to-master.md
Lines changed: 104 additions & 0 deletions b/‎docs/guides/node-management/worker-to-master.md
Lines changed: 104 additions & 0 deletions
diff --git a/‎docs/guides/storage-profiles/readme.md
Lines changed: 26 additions & 5 deletions b/‎docs/guides/storage-profiles/readme.md
Lines changed: 26 additions & 5 deletions
diff --git a/‎docs/guides/user-containers/readme.md
Lines changed: 5 additions & 5 deletions b/‎docs/guides/user-containers/readme.md
Lines changed: 5 additions & 5 deletions
@@ -3,9 +3,7 @@ authors: Blake Devcich <[email protected]>
 categories: provisioning
 ---
 
-# Data Movement Overview
-
-## Configuration
+# Data Movement Configuration
 
 Data Movement can be configured in multiple ways:
 
@@ -17,7 +15,7 @@ particular `NnfDataMovementProfile` (or the default). The second is done per the
 which allows for some configuration on a per-case basis, but is limited in scope. Both methods are
 meant to work in tandem.
 
-### Data Movement Profiles
+## Data Movement Profiles
 
 The server side configuration is controlled by creating `NnfDataMovementProfiles` resources in
 Kubernetes. These work similar to `NnfStorageProfiles`. See [here](../storage-profiles/readme.md)
@@ -26,11 +24,11 @@ for understanding how to use profiles, set a default, etc.
 For an in-depth understanding of the capabilities offered by Data Movement profiles, we recommend
 referring to the following resources:
 
-- [Type definition](https://github.com/NearNodeFlash/nnf-sos/blob/master/api/v1alpha1/nnfdatamovementprofile_types.go#L27) for `NnfDataMovementProfile`
-- [Sample](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/samples/nnf_v1alpha1_nnfdatamovementprofile.yaml) for `NnfDataMovementProfile`
-- [Online Examples](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/examples/nnf_v1alpha1_nnfdatamovementprofile.yaml) for `NnfDataMovementProfile`
+- [Type definition](https://github.com/NearNodeFlash/nnf-sos/blob/master/api/v1alpha6/nnfdatamovementprofile_types.go#L27) for `NnfDataMovementProfile`
+- [Sample](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/samples/nnf_v1alpha6_nnfdatamovementprofile.yaml) for `NnfDataMovementProfile`
+- [Online Examples](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/examples/nnf_nnfdatamovementprofile.yaml) for `NnfDataMovementProfile`
 
-### Copy Offload API Daemon
+## Copy Offload API Daemon
 
 The `CreateRequest` API call that is used to create Data Movement with the Copy Offload API has some
 options to allow a user to specify some options for that particular Data Movement operation. These
@@ -40,14 +38,14 @@ settings are on a per-request basis. These supplement the configuration in the
 The Copy Offload API requires the `nnf-dm` daemon to be running on the compute node. This daemon may
 be configured to run full-time, or it may be left in a disabled state if the WLM is expected to run
 it only when a user requests it. See [Compute Daemons](../compute-daemons/readme.md) for the systemd
-service configuration of the daemon. See `RequiredDaemons` in [Directive
+service configuration of the daemon. See `Requires` in [Directive
 Breakdown](../directive-breakdown/readme.md) for a description of how the user may request the
 daemon in the case where the WLM will run it only on demand.
 
 See the [DataMovementCreateRequest API](copy-offload-api.html#datamovement.DataMovementCreateRequest)
 definition for what can be configured.
 
-### SELinux and Data Movement
+## SELinux and Data Movement
 
 Careful consideration must be taken when enabling SELinux on compute nodes. Doing so will result in
 SELinux Extended File Attributes (xattrs) being placed on files created by applications running on
@@ -62,7 +60,7 @@ option.
 See the [`dcp` documentation](https://mpifileutils.readthedocs.io/en/latest/dcp.1.html) for more
 information.
 
-### `sshd` Configuration for Data Movement Workers
+## `sshd` Configuration for Data Movement Workers
 
 The `nnf-dm-worker-*` pods run `sshd` in order to listen for `mpirun` jobs to perform data movement.
 The number of simultaneous connections is limited via the sshd configuration (i.e. `MaxStartups`).
@@ -72,3 +70,82 @@ start rejecting connections once the limit is reached.
 
 The `sshd_config` is stored in the `nnf-dm-worker-config` `ConfigMap` so that it can be changed on
 a running system without needing to roll new images. This also enables site-specific configuration.
+
+## Enabling Core Dumps
+
+### Mounting core dump Volumes
+
+First, you must determine how your nodes handle core dumps. For example, if `systemd-coredump` is
+used, then core dumps inside containers will be moved to the host node automatically. If that is
+not the case, then a directory on the host nodes will need to be mounted into the Data Movement
+containers. This directory will contain any core dumps collected by data movement operations, mainly
+`mpirun` or `dcp`.
+
+For Data Movement, the pods are running on two types of Kubernetes nodes:
+
+- `nnf-dm-worker` pods on Rabbit nodes
+- `nnf-dm-controller` pods on Kubernetes worker nodes
+
+For all of these nodes, a core dump directory will need to be present and consistent across the
+nodes. Once in place, we can then edit the Kubernetes configuration to mount this directory from
+the host node to the containers using a [`hostPath`
+Volume](https://kubernetes.io/docs/concepts/storage/volumes/#hostpath).
+
+Adding this configuration will be done via the gitops repository for the system. Patches will be used
+to patch the `nnf-dm` containers to mount the core dump directory via a `hostPath` volume.
+
+An example of this configuration is provided in
+[`argocd-boilerplate`](https://github.com/NearNodeFlash/argocd-boilerplate/tree/main/environments/example-env/nnf-dm).
+There are two patch files that add Volumes to mount `/localdisk/dumps` from the host node at the
+same location inside the containers.
+
+- [`dm-controller-coredumps.yaml`](https://github.com/NearNodeFlash/argocd-boilerplate/blob/main/environments/example-env/nnf-dm/dm-controller-coredumps.yaml)
+- [`dm-manager-coredumps.yaml`](https://github.com/NearNodeFlash/argocd-boilerplate/blob/main/environments/example-env/nnf-dm/dm-manager-coredumps.yaml)
+
+[`kustomization.yaml`](https://github.com/NearNodeFlash/argocd-boilerplate/blob/main/environments/example-env/nnf-dm/kustomization.yaml#L13C1-L24C29)
+then applies these patches to the correct resources.
+
+### Editing the Data Movement Command
+
+Once the volume is in place, the Data Movement command must be updated to first `cd` into this
+directory. This ensures that the core dump is placed in that directory, making it accessible on the
+host node.
+
+To achieve this, update the Data Movement profiles in your gitops repository to include a preceding
+`cd /localdisk/dumps && ...` in the `command` before the Data Movement command. For example, the default profile in `environments/<system>/nnf-sos/default-nnfdatamovementprofile.yaml` would look like the following:
+
+```yaml
+kind: NnfDataMovementProfile
+metadata:
+  name: default
+  namespace: nnf-system
+data:
+  command: ulimit -n 2048 && cd /localdisk/dumps && mpirun --allow-run-as-root --hostfile $HOSTFILE dcp --progress
+    1 --uid $UID --gid $GID $SRC $DEST
+```
+
+Note that core patterns for containers are inherited from the host and that Linux containers do not
+support a container-only core pattern without also affecting the host node. This is why we must use
+a preceding `cd <dir>` in the Data Movement command.
+
+### Data Movement Debug Images
+
+To help with debugging symbols, it is a good idea to use the `debug` version of the two images used by the Data Movement containers:
+
+- `nnf-mfu-debug`
+- `nnf-dm-debug`
+
+Both of these images include debugging symbols for [Open MPI](https://www.open-mpi.org/) and [mpiFileUtils](https://mpifileutils.readthedocs.io/en/v0.11.1/).
+
+To use these images, edit the `environments/<system>/nnf-dm/kustomization.yaml` in your gitops repository and add the following:
+
+```yaml
+# Use images with mpifileutils/mpirun debug symbols
+images:
+- name: ghcr.io/nearnodeflash/nnf-dm
+  newName: ghcr.io/nearnodeflash/nnf-dm-debug
+- name: ghcr.io/nearnodeflash/nnf-mfu
+  newName: ghcr.io/nearnodeflash/nnf-mfu-debug
+```
+
+This will override the default images and use the debug symbols instead.
@@ -150,14 +150,14 @@ A location constraint consists of an `access` list and a `reference`.
     * `status.compute.constraints.location.access.type` specifies the connection type for the storage. This can be `network` or `physical`
     * `status.compute.constraints.location.access.priority` specifies how necessary the connection type is. This can be `mandatory` or `bestEffort`
 
-## RequiredDaemons
+## Requires
 
-The `status.requiredDaemons` section of the `DirectiveBreakdown` tells the WLM about any driver-specific daemons it must enable for the job; it is assumed that the WLM knows about the driver-specific daemons and that if the users are specifying these then the WLM knows how to start them. The `status.requiredDaemons` section will exist only for `jobdw` and `persistentdw` directives. An example of the `status.requiredDaemons` section is included below.
+The `status.requires` section of the `DirectiveBreakdown` tells the WLM about any driver-specific daemons it must enable for the job; it is assumed that the WLM knows about the driver-specific daemons and that if the users are specifying these then the WLM knows how to start them. The `status.requires` section will exist only for `jobdw` and `persistentdw` directives. An example of the `status.requires` section is included below.
 
 ```yaml
 status:
 ...
-  requiredDaemons:
+  requires:
   - copy-offload
 ...
 ```
 
@@ -27,6 +27,7 @@
 
 * [Disable or Drain a Node](node-management/drain.md)
 * [Debugging NVMe Namespaces](node-management/nvme-namespaces.md)
+* [Switch a Node From Worker to Master](node-management/worker-to-master.md)
 
 ## Monitoring the Cluster
 
 
@@ -0,0 +1,104 @@
+# Switch a Node From Worker to Master
+
+In this example, we have htx[40-42] as worker nodes. We will remove htx[40-41] as worker nodes and re-join them as master nodes.
+
+## Remove a k8s worker node
+
+Begin by moving their existing pods to htx42.
+
+Taint the nodes we're going to remove, to prevent new pods from being SCHEDULED on them (this is different from the taint we'll use in a later step):
+
+```console
+NODE=htx40
+kubectl taint node $NODE cray.nnf.node.drain=true:NoSchedule
+```
+
+Set deploy/dws-webhook to 1 replica. **This must be done via the gitops repo.** Edit `environments/$ENV/dws/kustomization.yaml`, and add this, then wait for argocd to put it into effect. Or, force argocd to sync it with `argocd app sync 1-dws`.
+
+```bash
+patches:
+- target:
+    kind: Deployment
+    name: dws-webhook
+  patch: |-
+    apiVersion: apps/v1
+    kind: Deployment
+    metadata:
+      name: dws-webhook
+    spec:
+      replicas: 1
+```
+
+Taint the nodes we're going to remove, to BUMP EXISTING PODS off them (this is different from the taint we used earlier). This will bump any DWS, NNF, ArgoCD, cert-manager, mpi-operator, luster-fs-operator pods. This leaves any lustre-csi-driver pods in place to assist with any Lustre unmounts that k8s may
+request.
+
+```console
+kubectl taint node $NODE cray.nnf.node.drain=true:NoExecute
+```
+
+Decommission [calico node](https://docs.tigera.io/calico/latest/operations/decommissioning-a-node).
+
+> If you are running the node controller or using the Kubernetes API datastore in policy-only mode, you do not need to manually decommission nodes.
+
+Tell k8s to drain the nodes.
+
+Use the cray.nnf.node taints above before running 'kubectl drain'. Those taints allow Workflows to be terminated cleanly, even when they have Lustre filesystems mounted in the pods on that node. It's important that the lustre-csi-driver pod on that node lives long enough to assist with those unmounts to allow K8s to finish pod cleanup.
+
+```console
+kubectl drain --ignore-daemonsets $NODE
+```
+
+Delete the worker nodes:
+
+```console
+kubectl delete node $NODE
+```
+
+Verify that the node is deleted from calico and k8s:
+
+```console
+kubectl calico get nodes (requires the calico plugin for kubectl)
+kubectl get nodes
+```
+
+Remove etcd, if it was a master:
+
+```console
+(on $NODE) kubeadm reset remove-etcd-member
+```
+
+It takes a while for all the containers on the deleted node to stop, so be patient.
+
+```console
+(on $NODE) crictl ps
+```
+
+Reset everything that "kubeadm join" did to that node:
+
+```console
+(on $NODE) kubeadm reset cleanup-node
+```
+
+## Join a node as a master
+
+Check for expired "kubeadm init" or "kubeadm-certs" tokens, or expired certs:
+
+The certificate-key from 'kubeadm init' is deleted after two hours. Use "kubeadm init phase upload-certs --upload-certs" to reload the certs later. This is explained in the output of the 'kubeadm init' command.
+
+```console
+kubeadm token list
+```
+
+The one labeled for "kubeadm init" is used as the token in "kubeadm join" commands. The one labeled for "managing TTL" controls the lifetime of the "kubeadm-certs" secret and the "bootstrap-token-XXX" secret. These secrets and this token, are deleted after the "managing TTL" token expires. A worker can still join after that expires; a master cannot.
+
+```console
+kubeadm certs check-expiration
+```
+
+Re-join that node as a master. When you ran "kubeadm init" to create the initial master node, you should have saved the output. It contains the "join" command that you need to create new masters. You want the commandline that includes the "--control-plane" option:
+
+```console
+(on $NODE) kubeadm join ... --control-plane ...
+```
+
+If that fails, it may tell you to generate new certs. Run the 'kubeadm init phase' command it specifies, and note the certificate key in the output. Replace the certificate key from your original join command with this new key and run the new join command.
@@ -288,21 +288,29 @@ In general, `scale` gives a simple way for users to get a filesystem that has pe
 
 ## Command Line Variables
 
-### pvcreate
+### global
+- `$JOBID` - expands to the Job ID from the Workflow
+- `$USERID` - expands to the User ID of the user who submitted the job
+- `$GROUPID` - expands to the Group ID of the user who submitted the job
+
+### LVM PV commands
 
 - `$DEVICE` - expands to the `/dev/<path>` value for one device that has been allocated
 
-### vgcreate
+### LVM VG commands
 
 - `$VG_NAME` - expands to a volume group name that is controlled by Rabbit software.
 - `$DEVICE_LIST` - expands to a list of space-separated `/dev/<path>` devices. This list will contain the devices that were iterated over for the pvcreate step.
+- `$DEVICE_NUM` - expands to the count of devices in `$DEVICE_LIST`
 
-### lvcreate
+### LVM LV Commands
 
 - `$VG_NAME` - see vgcreate above.
 - `$LV_NAME` - expands to a logical volume name that is controlled by Rabbit software.
 - `$DEVICE_NUM` - expands to a number indicating the number of devices allocated for the volume group.
 - `$DEVICE1, $DEVICE2, ..., $DEVICEn` - each expands to one of the devices from the `$DEVICE_LIST` above.
+- `$PERCENT_VG` - expands to the size that each LV should be based on a percentage of the total VG size
+- `$LV_SIZE` - expands to the size of the LV in kB in the format expected by `lvcreate`
 
 ### XFS mkfs
 
@@ -326,9 +334,15 @@ In general, `scale` gives a simple way for users to get a filesystem that has pe
 
 - `$FS_NAME` - expands to the filesystem name that was passed to Rabbit software from the workflow's #DW line.
 - `$MGS_NID` - expands to the NID of the MGS. If the MGS was orchestrated by nnf-sos then an appropriate internal value will be used.
-- `$POOL_NAME` - see zpool create above.
-- `$VOL_NAME` - expands to the volume name that will be created. This value will be `<pool_name>/<dataset>`, and is controlled by Rabbit software.
+- `$ZVOL_NAME` - expands to the volume name that will be created. This value will be `<pool_name>/<dataset>`, and is controlled by Rabbit software.
 - `$INDEX` - expands to the index value of the target and is controlled by Rabbit software.
+- `$TARGET_NAME` - expands to the name of the lustre target of the form `[fsname]-[target-type][index]` (e.g., `mylus-OST0003`)
+- `$BACKFS` - expands to the type of file system backing the Lustre target
+
+### Mount/Unmount
+
+- `$DEVICE` - expands to the device path to mount
+- `$MOUNT_PATH` - expands to the path to mount on
 
 ### PostMount/PreUnmount and PostActivate/PreDeactivate
 
@@ -343,3 +357,10 @@ These variables are for lustre only and can be used to perform PostMount activit
 - `$NUM_MGTMDTS` - expands to the number of combined MGTMDTs for the lustre filesystem
 - `$NUM_OSTS` - expands to the number of OSTs for the lustre filesystem
 - `$NUM_NNFNODES` - expands to the number of NNF Nodes for the lustre filesystem
+
+### NnfSystemStorage specific
+
+- `$COMPUTE_HOSTNAME` - Expands to the hostname of the compute node that will use the allocation. This can be used to add a tag during the lvcreate
+```
+lvCreate --zero n --activate n --extents $PERCENT_VG --addtag $COMPUTE_HOSTNAME ...
+```
@@ -52,9 +52,9 @@ The next few subsections provide an overview of the primary components comprisin
 aspects, they don't encompass every single detail. For an in-depth understanding of the capabilities
 offered by container profiles, we recommend referring to the following resources:
 
-- [Type definition](https://github.com/NearNodeFlash/nnf-sos/blob/master/api/v1alpha1/nnfcontainerprofile_types.go#L35) for `NnfContainerProfile`
-- [Sample](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/samples/nnf_v1alpha1_nnfcontainerprofile.yaml) for `NnfContainerProfile`
-- [Online Examples](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/examples/nnf_v1alpha1_nnfcontainerprofiles.yaml) for `NnfContainerProfile` (same as `kubectl get` above)
+- [Type definition](https://github.com/NearNodeFlash/nnf-sos/blob/master/api/v1alpha6/nnfcontainerprofile_types.go#L35) for `NnfContainerProfile`
+- [Sample](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/samples/nnf_v1alpha6_nnfcontainerprofile.yaml) for `NnfContainerProfile`
+- [Online Examples](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/examples/nnf_nnfcontainerprofiles.yaml) for `NnfContainerProfile` (same as `kubectl get` above)
 
 #### Container Storages
 
@@ -597,7 +597,7 @@ The following profile shows the placement of the `readonly-red-rock-slushy` secr
 in the previous step, and points to the user's `dean/red-rock-slushy:v1.0` container.
 
 ```yaml
-apiVersion: nnf.cray.hpe.com/v1alpha1
+apiVersion: nnf.cray.hpe.com/v1alpha6
 kind: NnfContainerProfile
 metadata:
   name: red-rock-slushy
@@ -635,7 +635,7 @@ insert two `imagePullSecrets` lists into the `mpiSpec` of the NnfContainerProfil
 launcher and the MPI worker.
 
 ```yaml
-apiVersion: nnf.cray.hpe.com/v1alpha1
+apiVersion: nnf.cray.hpe.com/v1alpha6
 kind: NnfContainerProfile
 metadata:
   name: mpi-red-rock-slushy