Merge pull request #156 from NearNodeFlash/release-v0.1.2

roehrich-hpe · web-flow · commit a01af37e411b · 2024-05-01T14:41:47.000-05:00
Release v0.1.2
diff --git a/.github/workflows/publish-main.yaml b/.github/workflows/publish-main.yaml
@@ -1,8 +1,6 @@
 name: Publish `main` Documentation
-on:
-  push:
-    branches:
-      - main
+
+on: [push]
 
 jobs:
   build:
@@ -34,3 +32,4 @@ jobs:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
         run: |
           mike deploy --push dev
+
diff --git a/docs/guides/global-lustre/readme.md b/docs/guides/global-lustre/readme.md
@@ -0,0 +1,85 @@
+---
+authors: Blake Devcich <blake.devcich@hpe.com>
+categories: provisioning
+---
+
+# Global Lustre
+
+## Background
+
+Adding global lustre to rabbit systems allows access to external file systems. This is primarily
+used for Data Movement, where a user can perform `copy_in` and `copy_out` directives with global
+lustre being the source and destination, respectively.
+
+Global lustre fileystems are represented by the `lustrefilesystems` resource in Kubernetes:
+
+```shell
+$ kubectl get lustrefilesystems -A
+NAMESPACE   NAME       FSNAME   MGSNIDS          AGE
+default     mylustre   mylustre 10.1.1.113@tcp   20d
+```
+
+An example resource is as follows:
+
+```yaml
+apiVersion: lus.cray.hpe.com/v1beta1
+kind: LustreFileSystem
+metadata:
+  name: mylustre
+  namespace: default
+spec:
+  mgsNids: 10.1.1.100@tcp
+  mountRoot: /p/mylustre
+  name: mylustre
+  namespaces:
+    default:
+      modes:
+        - ReadWriteMany
+```
+
+## Namespaces
+
+Note the `spec.namespaces` field. For each namespace listed, the `lustre-fs-operator` creates a
+PV/PVC pair in that namespace. This allows pods in that namespace to access global lustre. The
+`default` namespace should appear in this list. This makes the `lustrefilesystem` resource available
+to the `default` namespace, which makes it available to containers (e.g.  container workflows)
+running in the `default` namespace.
+
+The `nnf-dm-system` namespace is added automatically - no need to specify that manually here. The
+NNF Data Movement Manager is responsible for ensuring that the `nnf-dm-system` is in
+`spec.namespaces`. This is to ensure that the NNF DM Worker pods have global lustre mounted as long
+as `nnf-dm` is deployed. **To unmount global lustre from the NNF DM Worker pods, the
+`lustrefilesystem` resource must be deleted**.
+
+The `lustrefilesystem` resource itself should be created in the `default` namespace (i.e.
+`metadata.namespace`).
+
+## NNF Data Movement Manager
+
+The NNF Data Movement Manager is responsible for monitoring `lustrefilesystem` resources to mount
+(or umount) the global lustre filesystem in each of the NNF DM Worker pods. These pods run on each
+of the NNF nodes. This means with each addition or removal of `lustrefilesystems` resources, the DM
+worker pods restart to adjust their mount points.
+
+The NNF Data Movement Manager also places a finalizer on the `lustrefilesystem` resource to indicate
+that the resource is in use by Data Movement. This is to prevent the PV/PVC being deleted while they
+are being used by pods.
+
+## Adding Global Lustre
+
+As mentioned previously, the NNF Data Movement Manager monitors these resources and automatically
+adds the `nnf-dm-system` namespace to all `lustrefilesystem` resources. Once this happens, a PV/PVC
+is created for the `nnf-dm-system` namespace to access global lustre. The Manager updates the NNF DM
+Worker pods, which are then restarted to mount the global lustre file system.
+
+## Removing Global Lustre
+
+When a `lustrefilesystem` is deleted, the NNF DM Manager takes notice and starts to unmount the file
+system from the DM Worker pods - causing another restart of the DM Worker pods. Once this is
+finished, the DM finalizer is removed from the `lustrefilesystem` resource to signal that it is no
+longer in use by Data Movement.
+
+If a `lustrefilesystem` does not delete, check the finalizers to see what might still be using it.
+It is possible to get into a situation where `nnf-dm` has been undeployed, so there is nothing to
+remove the DM finalizer from the `lustrefilesystem` resource. If that is the case, then manually
+remove the DM finalizer so the deletion of the `lustrefilesystem` resource can continue.
diff --git a/docs/guides/index.md b/docs/guides/index.md
@@ -14,9 +14,12 @@
 * [Data Movement Configuration](data-movement/readme.md)
 * [Copy Offload API](data-movement/copy-offload-api.html)
 * [Lustre External MGT](external-mgs/readme.md)
+* [Global Lustre](global-lustre/readme.md)
 
 ## NNF User Containers
 
 * [User Containers](user-containers/readme.md)
 
+## Node Management
 
+* [Draining A Node] (node-management/drain.md)
diff --git a/docs/guides/node-management/drain.md b/docs/guides/node-management/drain.md
@@ -0,0 +1,52 @@
+# Draining A Node
+
+The NNF software consists of a collection of DaemonSets and Deployments. The pods
+on the Rabbit nodes are usually from DaemonSets. Because of this, the `kubectl drain`
+command is not able to remove the NNF software from a node.  See [Safely Drain a Node](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) for details about
+the limitations posed by DaemonSet pods.
+
+Given the limitations of DaemonSets, the NNF software will be drained by using taints,
+as described in
+[Taints and Tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/).
+
+## Drain NNF Pods From A Rabbit Node
+
+Drain the NNF software from a node by applying the `cray.nnf.node.drain` taint.
+The CSI driver pods will remain on the node to satisfy any unmount requests from k8s
+as it cleans up the NNF pods.
+
+```shell
+kubectl taint node $NODE cray.nnf.node.drain=true:NoSchedule cray.nnf.node.drain=true:NoExecute
+```
+
+To restore the node to service, remove the `cray.nnf.node.drain` taint.
+
+```shell
+kubectl taint node $NODE cray.nnf.node.drain-
+```
+
+## The CSI Driver
+
+While the CSI driver pods may be drained from a Rabbit node, it is advisable not to do so.
+
+**Warning** K8s relies on the CSI driver to unmount any filesystems that may have
+been mounted into a pod's namespace. If it is not present when k8s is attempting
+to remove a pod then the pod may be left in "Terminating" state. This is most
+obvious when draining the `nnf-dm-worker` pods which usually have filesystems
+mounted in them.
+
+Drain the CSI driver pod from a node by applying the `cray.nnf.node.drain.csi`
+taint.
+
+```shell
+kubectl taint node $NODE cray.nnf.node.drain.csi=true:NoSchedule cray.nnf.node.drain.csi=true:NoExecute
+```
+
+To restore the CSI driver pods to that node, remove the `cray.nnf.node.drain.csi` taint.
+
+```shell
+kubectl taint node $NODE cray.nnf.node.drain.csi-
+```
+
+This taint will also drain the remaining NNF software if has not already been
+drained by the `cray.nnf.node.drain` taint.
diff --git a/docs/guides/rbac-for-users/readme.md b/docs/guides/rbac-for-users/readme.md
@@ -125,15 +125,15 @@ Generate a key and certificate for our "flux" user, similar to the way we create
 
 After the keys have been generated, a new kubeconfig file can be created for the "flux" user, similar to the one for the "hpe" user above.  Again, substitute "flux" in place of "hpe".
 
-### Apply the provided ClusterRole and create a ClusterRoleBinding
+### Use the provided ClusterRole and create a ClusterRoleBinding
 
-DataWorkflowServices has already defined the role to be used with WLMs.  Simply apply the `workload-manager` ClusterRole from DataWorkflowServices to the system:
+DataWorkflowServices has already defined the role to be used with WLMs, named `dws-workload-manager`:
 
 ```console
-kubectl apply -f https://github.com/DataWorkflowServices/dws/raw/master/config/rbac/workload_manager_role.yaml
+kubectl get clusterrole dws-workload-manager
 ```
 
-Create and apply a ClusterRoleBinding to associate the "flux" user with the `workload-manager` ClusterRole:
+Create and apply a ClusterRoleBinding to associate the "flux" user with the `dws-workload-manager` ClusterRole:
 
 ClusterRoleBinding
 ```yaml
diff --git a/docs/guides/storage-profiles/readme.md b/docs/guides/storage-profiles/readme.md
@@ -45,6 +45,52 @@ To clear the default flag on a profile
 $ kubectl patch nnfstorageprofile durable -n nnf-system --type merge -p '{"data":{"default":false}}'
 ```
 
+# Creating The Initial Default Profile
+
+Create the initial default profile from scratch or by using the [NnfStorageProfile/template](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/examples/nnf_v1alpha1_nnfstorageprofile.yaml) resource as a template. If `nnf-deploy` was used to install nnf-sos then the default profile described below will have been created automatically.
+
+To use the `template` resource begin by obtaining a copy of it either from the nnf-sos repo or from a live system. To get it from a live system use the following command:
+
+```shell
+kubectl get nnfstorageprofile -n nnf-system template -o yaml > profile.yaml
+```
+
+Edit the `profile.yaml` file to trim the metadata section to contain only a name and namespace. The namespace must be left as nnf-system, but the name should be set to signify that this is the new default profile. In this example we will name it `default`.  The metadata section will look like the following, and will contain no other fields:
+
+```yaml
+metadata:
+  name: default
+  namespace: nnf-system
+```
+
+Mark this new profile as the default profile by setting `default: true` in the data section of the resource:
+
+```yaml
+data:
+  default: true
+```
+
+Apply this resource to the system and verify that it is the only one marked as the default resource:
+
+```shell
+kubectl get nnfstorageprofile -A
+```
+
+The output will appear similar to the following:
+
+```shell
+NAMESPACE    NAME       DEFAULT   AGE
+nnf-system   default    true      9s
+nnf-system   template   false     11s
+```
+
+The administrator should edit the `default` profile to record any cluster-specific settings.
+Maintain a copy of this resource YAML in a safe place so it isn't lost across upgrades.
+
+## Keeping The Default Profile Updated
+
+An upgrade of nnf-sos may include updates to the `template` profile. It may be necessary to manually copy these updates into the `default` profile.
+
 # Profile Parameters
 
 ## XFS
diff --git a/docs/repo-guides/release-nnf-sw/readme.md b/docs/repo-guides/release-nnf-sw/readme.md
@@ -35,7 +35,7 @@ need to be released separately.
 ## Primer
 
 This document is based on the process set forth by the [DataWorkflowServices Release
-Process](https://dataworkflowservices.github.io/v0.0.2/repo-guides/create-a-release/readme/).
+Process](https://dataworkflowservices.github.io/latest/repo-guides/create-a-release/readme/).
 Please read that as a background for this document before going any further.
 
 ## Requirements
@@ -93,7 +93,7 @@ just an example.
 
     |Repo                 |Update|
     |---------------------|------|
-    |`nnf-mfu`            |The new version of `nnf-mfu` is referenced by the `NNFMFU` variable in several places:<br><br>`nnf-sos`<br>1. `Makefile` replace `NNFMFU` with `nnf-mfu's` tag.<br><br>`nnf-dm`<br>1. In `Dockerfile` and `Makefile`, replace `NNFMU_VERSION` with the new version.<br>2. In `config/manager/kustomization.yaml`, replace `nnf-mfu`'s `newTag: <X.Y.Z>.`|
+    |`nnf-mfu`            |The new version of `nnf-mfu` is referenced by the `NNFMFU` variable in several places:<br><br>`nnf-sos`<br>1. `Makefile` replace `NNFMFU` with `nnf-mfu's` tag.<br><br>`nnf-dm`<br>1. In `Dockerfile` and `Makefile`, replace `NNFMU_VERSION` with the new version.<br>2. In `config/manager/kustomization.yaml`, replace `nnf-mfu`'s `newTag: <X.Y.Z>.`<br><br>`nnf-deploy`<br>1. In `config/repositories.yaml` replace `NNFMFU_VERSION` with the new version.|
     |`lustre-fs-operator` |update `config/manager/kustomization.yaml` with the correct version.|
     |`dws`                |update `config/manager/kustomization.yaml` with the correct version.|
     |`nnf-sos`            |update `config/manager/kustomization.yaml` with the correct version.|
@@ -183,6 +183,7 @@ that everything is current on `master` for `nnf-deploy`.
 
 12. Follow steps 6-7 from the previous section to finalize the release of `nnf-deploy`.
 
+**Please review documenation for changes you may have made**
 **The software is now released!**
 
 ## Clone a release
diff --git a/external/nnf-dm b/external/nnf-dm
@@ -1 +1 @@
-Subproject commit 8928a1d352d1f83d9a2ad92cc0a43a5088aa9150
+Subproject commit 8fe079a1a3bf23973fd24a5a185abdc4e295cd64
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -18,6 +18,8 @@ nav:
       - 'Storage Profiles': 'guides/storage-profiles/readme.md'
       - 'User Containers': 'guides/user-containers/readme.md'
       - 'Lustre External MGT': 'guides/external-mgs/readme.md'
+      - 'Global Lustre': 'guides/global-lustre/readme.md'
+      - 'Draining A Node': 'guides/node-management/drain.md'
   - 'RFCs':
       - rfcs/index.md
       - 'Rabbit Request For Comment Process': 'rfcs/0001/readme.md'