Skip to content

Commit a01af37

Browse files
authored
Merge pull request #156 from NearNodeFlash/release-v0.1.2
Release v0.1.2
2 parents 98774d8 + 0afecc0 commit a01af37

File tree

9 files changed

+199
-11
lines changed

9 files changed

+199
-11
lines changed

.github/workflows/publish-main.yaml

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11
name: Publish `main` Documentation
2-
on:
3-
push:
4-
branches:
5-
- main
2+
3+
on: [push]
64

75
jobs:
86
build:
@@ -34,3 +32,4 @@ jobs:
3432
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
3533
run: |
3634
mike deploy --push dev
35+

docs/guides/global-lustre/readme.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
---
2+
authors: Blake Devcich <[email protected]>
3+
categories: provisioning
4+
---
5+
6+
# Global Lustre
7+
8+
## Background
9+
10+
Adding global lustre to rabbit systems allows access to external file systems. This is primarily
11+
used for Data Movement, where a user can perform `copy_in` and `copy_out` directives with global
12+
lustre being the source and destination, respectively.
13+
14+
Global lustre fileystems are represented by the `lustrefilesystems` resource in Kubernetes:
15+
16+
```shell
17+
$ kubectl get lustrefilesystems -A
18+
NAMESPACE NAME FSNAME MGSNIDS AGE
19+
default mylustre mylustre 10.1.1.113@tcp 20d
20+
```
21+
22+
An example resource is as follows:
23+
24+
```yaml
25+
apiVersion: lus.cray.hpe.com/v1beta1
26+
kind: LustreFileSystem
27+
metadata:
28+
name: mylustre
29+
namespace: default
30+
spec:
31+
mgsNids: 10.1.1.100@tcp
32+
mountRoot: /p/mylustre
33+
name: mylustre
34+
namespaces:
35+
default:
36+
modes:
37+
- ReadWriteMany
38+
```
39+
40+
## Namespaces
41+
42+
Note the `spec.namespaces` field. For each namespace listed, the `lustre-fs-operator` creates a
43+
PV/PVC pair in that namespace. This allows pods in that namespace to access global lustre. The
44+
`default` namespace should appear in this list. This makes the `lustrefilesystem` resource available
45+
to the `default` namespace, which makes it available to containers (e.g. container workflows)
46+
running in the `default` namespace.
47+
48+
The `nnf-dm-system` namespace is added automatically - no need to specify that manually here. The
49+
NNF Data Movement Manager is responsible for ensuring that the `nnf-dm-system` is in
50+
`spec.namespaces`. This is to ensure that the NNF DM Worker pods have global lustre mounted as long
51+
as `nnf-dm` is deployed. **To unmount global lustre from the NNF DM Worker pods, the
52+
`lustrefilesystem` resource must be deleted**.
53+
54+
The `lustrefilesystem` resource itself should be created in the `default` namespace (i.e.
55+
`metadata.namespace`).
56+
57+
## NNF Data Movement Manager
58+
59+
The NNF Data Movement Manager is responsible for monitoring `lustrefilesystem` resources to mount
60+
(or umount) the global lustre filesystem in each of the NNF DM Worker pods. These pods run on each
61+
of the NNF nodes. This means with each addition or removal of `lustrefilesystems` resources, the DM
62+
worker pods restart to adjust their mount points.
63+
64+
The NNF Data Movement Manager also places a finalizer on the `lustrefilesystem` resource to indicate
65+
that the resource is in use by Data Movement. This is to prevent the PV/PVC being deleted while they
66+
are being used by pods.
67+
68+
## Adding Global Lustre
69+
70+
As mentioned previously, the NNF Data Movement Manager monitors these resources and automatically
71+
adds the `nnf-dm-system` namespace to all `lustrefilesystem` resources. Once this happens, a PV/PVC
72+
is created for the `nnf-dm-system` namespace to access global lustre. The Manager updates the NNF DM
73+
Worker pods, which are then restarted to mount the global lustre file system.
74+
75+
## Removing Global Lustre
76+
77+
When a `lustrefilesystem` is deleted, the NNF DM Manager takes notice and starts to unmount the file
78+
system from the DM Worker pods - causing another restart of the DM Worker pods. Once this is
79+
finished, the DM finalizer is removed from the `lustrefilesystem` resource to signal that it is no
80+
longer in use by Data Movement.
81+
82+
If a `lustrefilesystem` does not delete, check the finalizers to see what might still be using it.
83+
It is possible to get into a situation where `nnf-dm` has been undeployed, so there is nothing to
84+
remove the DM finalizer from the `lustrefilesystem` resource. If that is the case, then manually
85+
remove the DM finalizer so the deletion of the `lustrefilesystem` resource can continue.

docs/guides/index.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,12 @@
1414
* [Data Movement Configuration](data-movement/readme.md)
1515
* [Copy Offload API](data-movement/copy-offload-api.html)
1616
* [Lustre External MGT](external-mgs/readme.md)
17+
* [Global Lustre](global-lustre/readme.md)
1718

1819
## NNF User Containers
1920

2021
* [User Containers](user-containers/readme.md)
2122

23+
## Node Management
2224

25+
* [Draining A Node] (node-management/drain.md)

docs/guides/node-management/drain.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Draining A Node
2+
3+
The NNF software consists of a collection of DaemonSets and Deployments. The pods
4+
on the Rabbit nodes are usually from DaemonSets. Because of this, the `kubectl drain`
5+
command is not able to remove the NNF software from a node. See [Safely Drain a Node](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) for details about
6+
the limitations posed by DaemonSet pods.
7+
8+
Given the limitations of DaemonSets, the NNF software will be drained by using taints,
9+
as described in
10+
[Taints and Tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/).
11+
12+
## Drain NNF Pods From A Rabbit Node
13+
14+
Drain the NNF software from a node by applying the `cray.nnf.node.drain` taint.
15+
The CSI driver pods will remain on the node to satisfy any unmount requests from k8s
16+
as it cleans up the NNF pods.
17+
18+
```shell
19+
kubectl taint node $NODE cray.nnf.node.drain=true:NoSchedule cray.nnf.node.drain=true:NoExecute
20+
```
21+
22+
To restore the node to service, remove the `cray.nnf.node.drain` taint.
23+
24+
```shell
25+
kubectl taint node $NODE cray.nnf.node.drain-
26+
```
27+
28+
## The CSI Driver
29+
30+
While the CSI driver pods may be drained from a Rabbit node, it is advisable not to do so.
31+
32+
**Warning** K8s relies on the CSI driver to unmount any filesystems that may have
33+
been mounted into a pod's namespace. If it is not present when k8s is attempting
34+
to remove a pod then the pod may be left in "Terminating" state. This is most
35+
obvious when draining the `nnf-dm-worker` pods which usually have filesystems
36+
mounted in them.
37+
38+
Drain the CSI driver pod from a node by applying the `cray.nnf.node.drain.csi`
39+
taint.
40+
41+
```shell
42+
kubectl taint node $NODE cray.nnf.node.drain.csi=true:NoSchedule cray.nnf.node.drain.csi=true:NoExecute
43+
```
44+
45+
To restore the CSI driver pods to that node, remove the `cray.nnf.node.drain.csi` taint.
46+
47+
```shell
48+
kubectl taint node $NODE cray.nnf.node.drain.csi-
49+
```
50+
51+
This taint will also drain the remaining NNF software if has not already been
52+
drained by the `cray.nnf.node.drain` taint.

docs/guides/rbac-for-users/readme.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -125,15 +125,15 @@ Generate a key and certificate for our "flux" user, similar to the way we create
125125

126126
After the keys have been generated, a new kubeconfig file can be created for the "flux" user, similar to the one for the "hpe" user above. Again, substitute "flux" in place of "hpe".
127127

128-
### Apply the provided ClusterRole and create a ClusterRoleBinding
128+
### Use the provided ClusterRole and create a ClusterRoleBinding
129129

130-
DataWorkflowServices has already defined the role to be used with WLMs. Simply apply the `workload-manager` ClusterRole from DataWorkflowServices to the system:
130+
DataWorkflowServices has already defined the role to be used with WLMs, named `dws-workload-manager`:
131131

132132
```console
133-
kubectl apply -f https://github.com/DataWorkflowServices/dws/raw/master/config/rbac/workload_manager_role.yaml
133+
kubectl get clusterrole dws-workload-manager
134134
```
135135

136-
Create and apply a ClusterRoleBinding to associate the "flux" user with the `workload-manager` ClusterRole:
136+
Create and apply a ClusterRoleBinding to associate the "flux" user with the `dws-workload-manager` ClusterRole:
137137

138138
ClusterRoleBinding
139139
```yaml

docs/guides/storage-profiles/readme.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,52 @@ To clear the default flag on a profile
4545
$ kubectl patch nnfstorageprofile durable -n nnf-system --type merge -p '{"data":{"default":false}}'
4646
```
4747

48+
# Creating The Initial Default Profile
49+
50+
Create the initial default profile from scratch or by using the [NnfStorageProfile/template](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/examples/nnf_v1alpha1_nnfstorageprofile.yaml) resource as a template. If `nnf-deploy` was used to install nnf-sos then the default profile described below will have been created automatically.
51+
52+
To use the `template` resource begin by obtaining a copy of it either from the nnf-sos repo or from a live system. To get it from a live system use the following command:
53+
54+
```shell
55+
kubectl get nnfstorageprofile -n nnf-system template -o yaml > profile.yaml
56+
```
57+
58+
Edit the `profile.yaml` file to trim the metadata section to contain only a name and namespace. The namespace must be left as nnf-system, but the name should be set to signify that this is the new default profile. In this example we will name it `default`. The metadata section will look like the following, and will contain no other fields:
59+
60+
```yaml
61+
metadata:
62+
name: default
63+
namespace: nnf-system
64+
```
65+
66+
Mark this new profile as the default profile by setting `default: true` in the data section of the resource:
67+
68+
```yaml
69+
data:
70+
default: true
71+
```
72+
73+
Apply this resource to the system and verify that it is the only one marked as the default resource:
74+
75+
```shell
76+
kubectl get nnfstorageprofile -A
77+
```
78+
79+
The output will appear similar to the following:
80+
81+
```shell
82+
NAMESPACE NAME DEFAULT AGE
83+
nnf-system default true 9s
84+
nnf-system template false 11s
85+
```
86+
87+
The administrator should edit the `default` profile to record any cluster-specific settings.
88+
Maintain a copy of this resource YAML in a safe place so it isn't lost across upgrades.
89+
90+
## Keeping The Default Profile Updated
91+
92+
An upgrade of nnf-sos may include updates to the `template` profile. It may be necessary to manually copy these updates into the `default` profile.
93+
4894
# Profile Parameters
4995

5096
## XFS

docs/repo-guides/release-nnf-sw/readme.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ need to be released separately.
3535
## Primer
3636

3737
This document is based on the process set forth by the [DataWorkflowServices Release
38-
Process](https://dataworkflowservices.github.io/v0.0.2/repo-guides/create-a-release/readme/).
38+
Process](https://dataworkflowservices.github.io/latest/repo-guides/create-a-release/readme/).
3939
Please read that as a background for this document before going any further.
4040

4141
## Requirements
@@ -93,7 +93,7 @@ just an example.
9393
9494
|Repo |Update|
9595
|---------------------|------|
96-
|`nnf-mfu` |The new version of `nnf-mfu` is referenced by the `NNFMFU` variable in several places:<br><br>`nnf-sos`<br>1. `Makefile` replace `NNFMFU` with `nnf-mfu's` tag.<br><br>`nnf-dm`<br>1. In `Dockerfile` and `Makefile`, replace `NNFMU_VERSION` with the new version.<br>2. In `config/manager/kustomization.yaml`, replace `nnf-mfu`'s `newTag: <X.Y.Z>.`|
96+
|`nnf-mfu` |The new version of `nnf-mfu` is referenced by the `NNFMFU` variable in several places:<br><br>`nnf-sos`<br>1. `Makefile` replace `NNFMFU` with `nnf-mfu's` tag.<br><br>`nnf-dm`<br>1. In `Dockerfile` and `Makefile`, replace `NNFMU_VERSION` with the new version.<br>2. In `config/manager/kustomization.yaml`, replace `nnf-mfu`'s `newTag: <X.Y.Z>.`<br><br>`nnf-deploy`<br>1. In `config/repositories.yaml` replace `NNFMFU_VERSION` with the new version.|
9797
|`lustre-fs-operator` |update `config/manager/kustomization.yaml` with the correct version.|
9898
|`dws` |update `config/manager/kustomization.yaml` with the correct version.|
9999
|`nnf-sos` |update `config/manager/kustomization.yaml` with the correct version.|
@@ -183,6 +183,7 @@ that everything is current on `master` for `nnf-deploy`.
183183

184184
12. Follow steps 6-7 from the previous section to finalize the release of `nnf-deploy`.
185185

186+
**Please review documenation for changes you may have made**
186187
**The software is now released!**
187188

188189
## Clone a release

external/nnf-dm

Submodule nnf-dm updated 187 files

mkdocs.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ nav:
1818
- 'Storage Profiles': 'guides/storage-profiles/readme.md'
1919
- 'User Containers': 'guides/user-containers/readme.md'
2020
- 'Lustre External MGT': 'guides/external-mgs/readme.md'
21+
- 'Global Lustre': 'guides/global-lustre/readme.md'
22+
- 'Draining A Node': 'guides/node-management/drain.md'
2123
- 'RFCs':
2224
- rfcs/index.md
2325
- 'Rabbit Request For Comment Process': 'rfcs/0001/readme.md'

0 commit comments

Comments
 (0)