Skip to content

Commit 0030f36

Browse files
authored
Merge release v0.1.5
Release v0.1.5
2 parents 27732c5 + e9c843a commit 0030f36

File tree

9 files changed

+343
-6
lines changed

9 files changed

+343
-6
lines changed

docs/guides/data-movement/readme.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,14 @@ The `CreateRequest` API call that is used to create Data Movement with the Copy
9090
options to allow a user to specify some options for that particular Data Movement. These settings
9191
are on a per-request basis.
9292

93+
The Copy Offload API requires the `nnf-dm` daemon to be running on the compute node. This daemon may be configured to run full-time, or it may be left in a disabled state if the WLM is expected to run it only when a user requests it. See [Compute Daemons](../compute-daemons/readme.md) for the systemd service configuration of the daemon. See `RequiredDaemons` in [Directive Breakdown](../directive-breakdown/readme.md) for a description of how the user may request the daemon, in the case where the WLM will run it only on demand.
94+
95+
If the WLM is running the `nnf-dm` daemon only on demand, then the user can request that the daemon be running for their job by specifying `requires=copy-offload` in their `DW` directive. The following is an example:
96+
97+
```bash
98+
#DW jobdw type=xfs capacity=1GB name=stg1 requires=copy-offload
99+
```
100+
93101
See the [DataMovementCreateRequest API](copy-offload-api.html#datamovement.DataMovementCreateRequest)
94102
definition for what can be configured.
95103

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
---
2+
authors: Matt Richerson <[email protected]>
3+
categories: provisioning
4+
---
5+
6+
# Directive Breakdown
7+
8+
## Background
9+
10+
The `#DW` directives in a job script are not intended to be interpreted by the workload manager. The workload manager passes the `#DW` directives to the NNF software through the DWS `workflow` resource, and the NNF software determines what resources are needed to satisfy the directives. The NNF software communicates this information back to the workload manager through the DWS `DirectiveBreakdown` resource. This document describes how the WLM should interpret the information in the `DirectiveBreakdown`.
11+
12+
## DirectiveBreakdown Overview
13+
14+
The DWS `DirectiveBreakdown` contains all the information necessary to inform the WLM how to pick storage and compute nodes for a job. The `DirectiveBreakdown` resource is created by the NNF software during the `Proposal` phase of the DWS workflow. The `spec` section of the `DirectiveBreakdown` is filled in with the `#DW` directive by the NNF software, and the `status` section contains the information for the WLM. The WLM should wait until the `status.ready` field is true before interpreting the rest of the `status` fields.
15+
16+
The contents of the `DirectiveBreakdown` will look different depending on the file system type and options specified by the user. The `status` section contains enough information that the WLM may be able to figure out the underlying file system type requested by the user, but the WLM should not make any decisions based on the file system type. Instead, the WLM should make storage and compute allocation decisions based on the generic information provided in the `DirectiveBreakdown` since the storage and compute allocations needed to satisfy a `#DW` directive may differ based on options other than the file system type.
17+
18+
## Storage Nodes
19+
20+
The `status.storage` section of the `DirectiveBreakdown` describes how the storage allocations should be made and any constraints on the NNF nodes that can be picked. The `status.storage` section will exist only for `jobdw` and `create_persistent` directives. An example of the `status.storage` section is included below.
21+
22+
```yaml
23+
...
24+
spec:
25+
directive: '#DW jobdw capacity=1GiB type=xfs name=example'
26+
userID: 7900
27+
status:
28+
...
29+
ready: true
30+
storage:
31+
allocationSets:
32+
- allocationStrategy: AllocatePerCompute
33+
constraints:
34+
labels:
35+
- dataworkflowservices.github.io/storage=Rabbit
36+
label: xfs
37+
minimumCapacity: 1073741824
38+
lifetime: job
39+
reference:
40+
kind: Servers
41+
name: example-0
42+
namespace: default
43+
...
44+
```
45+
46+
* `status.storage.allocationSets` is a list of storage allocation sets that are needed for the job. An allocation set is a group of individual storage allocations that all have the same parameters and requirements. Depending on the storage type specified by the user, there may be more than one allocation set. Allocation sets should be handled independently.
47+
48+
* `status.storage.allocationSets.allocationStrategy` specifies how the allocations should be made.
49+
* `AllocatePerCompute` - One allocation is needed per compute node in the job. The size of an individual allocation is specified in `status.storage.allocationSets.minimumCapacity`
50+
* `AllocateAcrossServers` - One or more allocations are needed with an aggregate capacity of `status.storage.allocationSets.minimumCapacity`. This allocation strategy does not imply anything about how many allocations to make per NNF node or how many NNF nodes to use. The allocations on each NNF node should be the same size.
51+
* `AllocateSingleServer` - One allocation is needed with a capacity of `status.storage.allocationSets.minimumCapacity`
52+
53+
* `status.storage.allocationSets.constraints` is a set of requirements for which NNF nodes can be picked. More information about the different constraint types is provided in the [Storage Constraints](readme.md#storage-constraints) section below.
54+
55+
* `status.storage.allocationSets.label` is an opaque string that the WLM uses when creating the spec.allocationSets entry in the DWS `Servers` resource.
56+
57+
* `status.storage.allocationSets.minimumCapacity` is the allocation capacity in bytes. The interpretation of this field depends on the value of `status.storage.allocationSets.allocationStrategy`
58+
59+
* `status.storage.lifetime` is used to specify how long the storage allocations will last.
60+
* `job` - The allocation will last for the lifetime of the job
61+
* `persistent` - The allocation will last for longer than the lifetime of the job
62+
63+
* `status.storage.reference` is an object reference to a DWS `Servers` resource where the WLM can specify allocations
64+
65+
### Storage Constraints
66+
67+
Constraints on an allocation set provide additional requirements for how the storage allocations should be made on NNF nodes.
68+
69+
* `labels` specifies a list of labels that must all be on a DWS `Storage` resource in order for an allocation to exist on that `Storage`.
70+
```yaml
71+
constraints:
72+
labels:
73+
- dataworkflowservices.github.io/storage=Rabbit
74+
- mysite.org/pool=firmware_test
75+
```
76+
```yaml
77+
apiVersion: dataworkflowservices.github.io/v1alpha2
78+
kind: Storage
79+
metadata:
80+
labels:
81+
dataworkflowservices.github.io/storage: Rabbit
82+
mysite.org/pool: firmware_test
83+
mysite.org/drive-speed: fast
84+
name: rabbit-node-1
85+
namespace: default
86+
...
87+
```
88+
89+
* `colocation` specifies how two or more allocations influence the location of each other. The colocation constraint has two fields, `type` and `key`. Currently, the only value for `type` is `exclusive`. `key` can be any value. This constraint means that the allocations from an allocation set with the colocation constraint can't be placed on an NNF node with another allocation whose allocation set has a colocation constraint with the same key. Allocations from allocation sets with colocation constraints with different keys or allocation sets without the colocation constraint are okay to put on the same NNF node.
90+
```yaml
91+
constraints:
92+
colocation:
93+
type: exclusive
94+
key: lustre-mgt
95+
```
96+
97+
* `count` this field specifies the number of allocations to make when `status.storage.allocationSets.allocationStrategy` is `AllocateAcrossServers`
98+
```yaml
99+
constraints:
100+
count: 5
101+
```
102+
103+
* `scale` is a unitless value from 1-10 that is meant to guide the WLM on how many allocations to make when `status.storage.allocationSets.allocationStrategy` is `AllocateAcrossServers`. The actual number of allocations is not meant to correspond to the value of scale. Rather, 1 would indicate the minimum number of allocations to reach `status.storage.allocationSets.minimumCapacity`, and 10 would be the maximum number of allocations that make sense given the `status.storage.allocationSets.minimumCapacity` and the compute node count. The NNF software does not interpret this value, and it is up to the WLM to define its meaning.
104+
```yaml
105+
constraints:
106+
scale: 8
107+
```
108+
109+
## Compute Nodes
110+
111+
The `status.compute` section of the `DirectiveBreakdown` describes how the WLM should pick compute nodes for a job. The `status.compute` section will exist only for `jobdw` and `persistentdw` directives. An example of the `status.compute` section is included below.
112+
113+
```yaml
114+
...
115+
spec:
116+
directive: '#DW jobdw capacity=1TiB type=lustre name=example'
117+
userID: 3450
118+
status:
119+
...
120+
compute:
121+
constraints:
122+
location:
123+
- access:
124+
- priority: mandatory
125+
type: network
126+
- priority: bestEffort
127+
type: physical
128+
reference:
129+
fieldPath: servers.spec.allocationSets[0]
130+
kind: Servers
131+
name: example-0
132+
namespace: default
133+
- access:
134+
- priority: mandatory
135+
type: network
136+
reference:
137+
fieldPath: servers.spec.allocationSets[1]
138+
kind: Servers
139+
name: example-0
140+
namespace: default
141+
...
142+
```
143+
144+
The `status.compute.constraints` section lists any constraints on which compute nodes can be used. Currently the only constraint type is the `location` constraint. `status.compute.constraints.location` is a list of location constraints that all must be satisfied.
145+
146+
A location constraint consists of an `access` list and a `reference`.
147+
148+
* `status.compute.constraints.location.reference` is an object reference with a `fieldPath` that points to an allocation set in the `Servers` resource. If this is from a `#DW jobdw` directive, the `Servers` resource won't be filled in until the WLM picks storage nodes for the allocations.
149+
* `status.compute.constraints.location.access` is a list that specifies what type of access the compute nodes need to have to the storage allocations in the allocation set. An allocation set may have multiple access types that are required
150+
* `status.compute.constraints.location.access.type` specifies the connection type for the storage. This can be `network` or `physical`
151+
* `status.compute.constraints.location.access.priority` specifies how necessary the connection type is. This can be `mandatory` or `bestEffort`
152+
153+
## RequiredDaemons
154+
155+
The `status.requiredDaemons` section of the `DirectiveBreakdown` tells the WLM about any driver-specific daemons it must enable for the job; it is assumed that the WLM knows about the driver-specific daemons and that if the users are specifying these then the WLM knows how to start them. The `status.requiredDaemons` section will exist only for `jobdw` and `persistentdw` directives. An example of the `status.requiredDaemons` section is included below.
156+
157+
```yaml
158+
status:
159+
...
160+
requiredDaemons:
161+
- copy-offload
162+
...
163+
```
164+
165+
The allowed list of required daemons that may be specified is defined in the [nnf-ruleset.yaml for DWS](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/dws/nnf-ruleset.yaml), found in the `nnf-sos` repository. The `ruleDefs.key[requires]` statement is specified in two places in the ruleset, one for `jobdw` and the second for `persistentdw`. The ruleset allows a list of patterns to be specified, allowing one for each of the allowed daemons.
166+
167+
The `DW` directive will include a comma-separated list of daemons after the `requires` keyword. The following is an example:
168+
169+
```bash
170+
#DW jobdw type=xfs capacity=1GB name=stg1 requires=copy-offload
171+
```
172+
173+
The `DWDirectiveRule` resource currently active on the system can be viewed with:
174+
175+
```console
176+
kubectl get -n dws-system dwdirectiverule nnf -o yaml
177+
```
178+
179+
### Valid Daemons
180+
181+
Each site should define the list of daemons that are valid for that site and recognized by that site's WLM. The initial `nnf-ruleset.yaml` defines only one, called `copy-offload`. When a user specifies `copy-offload` in their `DW` directive, they are stating that their compute-node application will use the Copy Offload API Daemon described in the [Data Movement Configuration](../data-movement/readme.md).

docs/guides/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
* [Copy Offload API](data-movement/copy-offload-api.html)
1616
* [Lustre External MGT](external-mgs/readme.md)
1717
* [Global Lustre](global-lustre/readme.md)
18+
* [Directive Breakdown](directive-breakdown/readme.md)
1819

1920
## NNF User Containers
2021

@@ -23,3 +24,4 @@
2324
## Node Management
2425

2526
* [Draining A Node](node-management/drain.md)
27+
* [Debugging NVMe Namespaces](node-management/nvme-namespaces.md)
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# Debugging NVMe Namespaces
2+
3+
## Total Space Available or Used
4+
5+
Find the total space available, and the total space used, on a Rabbit node using the Redfish API. One way to access the API is to use the `nnf-node-manager` pod on that node.
6+
7+
To view the space on node ee50, find its `nnf-node-manager` pod and then exec into it to query the Redfish API:
8+
9+
```console
10+
[richerso@ee1:~]$ kubectl get pods -A -o wide | grep ee50 | grep node-manager
11+
nnf-system nnf-node-manager-jhglm 1/1 Running 0 61m 10.85.71.11 ee50 <none> <none>
12+
```
13+
14+
Then query the Redfish API to view the `AllocatedBytes` and `GuaranteedBytes`:
15+
16+
```console
17+
[richerso@ee1:~]$ kubectl exec --stdin --tty -n nnf-system nnf-node-manager-jhglm -- curl -S localhost:50057/redfish/v1/StorageServices/NNF/CapacitySource | jq
18+
{
19+
"@odata.id": "/redfish/v1/StorageServices/NNF/CapacitySource",
20+
"@odata.type": "#CapacitySource.v1_0_0.CapacitySource",
21+
"Id": "0",
22+
"Name": "Capacity Source",
23+
"ProvidedCapacity": {
24+
"Data": {
25+
"AllocatedBytes": 128849888,
26+
"ConsumedBytes": 128849888,
27+
"GuaranteedBytes": 307132496928,
28+
"ProvisionedBytes": 307261342816
29+
},
30+
"Metadata": {},
31+
"Snapshot": {}
32+
},
33+
"ProvidedClassOfService": {},
34+
"ProvidingDrives": {},
35+
"ProvidingPools": {},
36+
"ProvidingVolumes": {},
37+
"Actions": {},
38+
"ProvidingMemory": {},
39+
"ProvidingMemoryChunks": {}
40+
}
41+
```
42+
43+
## Total Orphaned or Leaked Space
44+
45+
To determine the amount of orphaned space, look at the Rabbit node when there are no allocations on it. If there are no allocations then there should be no `NnfNodeBlockStorages` in the k8s namespace with the Rabbit's name:
46+
47+
```console
48+
[richerso@ee1:~]$ kubectl get nnfnodeblockstorage -n ee50
49+
No resources found in ee50 namespace.
50+
```
51+
52+
To check that there are no orphaned namespaces, you can use the nvme command while logged into that Rabbit node:
53+
54+
```console
55+
[root@ee50:~]# nvme list
56+
Node SN Model Namespace Usage Format FW Rev
57+
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
58+
/dev/nvme0n1 S666NN0TB11877 SAMSUNG MZ1L21T9HCLS-00A07 1 8.57 GB / 1.92 TB 512 B + 0 B GDC7302Q
59+
```
60+
61+
There should be no namespaces on the kioxia drives:
62+
63+
```console
64+
[root@ee50:~]# nvme list | grep -i kioxia
65+
[root@ee50:~]#
66+
```
67+
68+
If there are namespaces listed, and there weren't any `NnfNodeBlockStorages` on the node, then they need to be deleted through the Rabbit software. The `NnfNodeECData` resource is a persistent data store for the allocations that should exist on the Rabbit. By deleting it, and then deleting the nnf-node-manager pod, it causes nnf-node-manager to delete the orphaned namespaces. This can take a few minutes after you actually delete the pod:
69+
70+
```console
71+
kubectl delete nnfnodeecdata ec-data -n ee50
72+
kubectl delete pod -n nnf-system nnf-node-manager-jhglm
73+
```

docs/guides/rbac-for-users/readme.md

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -133,9 +133,11 @@ DataWorkflowServices has already defined the role to be used with WLMs, named `d
133133
kubectl get clusterrole dws-workload-manager
134134
```
135135

136-
Create and apply a ClusterRoleBinding to associate the "flux" user with the `dws-workload-manager` ClusterRole:
136+
If the "flux" user requires only the normal WLM permissions, then create and apply a ClusterRoleBinding to associate the "flux" user with the `dws-workload-manager` ClusterRole.
137137

138-
ClusterRoleBinding
138+
The `dws-workload-manager role is defined in [workload_manager_role.yaml](https://github.com/DataWorkflowServices/dws/blob/master/config/rbac/workload_manager_role.yaml).
139+
140+
ClusterRoleBinding for WLM permissions only:
139141
```yaml
140142
apiVersion: rbac.authorization.k8s.io/v1
141143
kind: ClusterRoleBinding
@@ -151,4 +153,24 @@ roleRef:
151153
apiGroup: rbac.authorization.k8s.io
152154
```
153155

156+
If the "flux" user requires the normal WLM permissions as well as some of the NNF permissions, perhaps to collect some NNF resources for debugging, then create and apply a ClusterRoleBinding to associate the "flux" user with the `nnf-workload-manager` ClusterRole.
157+
158+
The `nnf-workload-manager` role is defined in [workload_manager_nnf_role.yaml](https://github.com/NearNodeFlash/nnf-sos/blob/master/config/rbac/workload_manager_nnf_role.yaml).
159+
160+
ClusterRoleBinding for WLM and NNF permissions:
161+
```yaml
162+
apiVersion: rbac.authorization.k8s.io/v1
163+
kind: ClusterRoleBinding
164+
metadata:
165+
name: flux
166+
subjects:
167+
- kind: User
168+
name: flux
169+
apiGroup: rbac.authorization.k8s.io
170+
roleRef:
171+
kind: ClusterRole
172+
name: nnf-workload-manager
173+
apiGroup: rbac.authorization.k8s.io
174+
```
175+
154176
The WLM should then use the kubeconfig file associated with this "flux" user to access the DataWorkflowServices API and the Rabbit system.

0 commit comments

Comments
 (0)