Skip to content

Commit 58f70e7

Browse files
authored
Update config files and fix errors found in testing new configs (#214)
Add --RESEnvironmentName to the installer Ease initial integration with Research and Engineering Studio (RES). Automatically add the correct submitter security groups and configure the /home directory. Automatically choose the subnets if not specified based on RES subnets. Resolves #207 ============================ Update template config files Added more comments to clarify that these are examples that should be copied and customized by users. Added comments for typical configuration options. Deleted obsolete configs that were from v1. Resolves #203 ============================= Set default head node instance type based on architecture. Resolves #206 ============================== Clean up ansible-lint errors and warnings. Arm architecture cluster was failing because of an incorrect condition in the ansible playbook that is flagged by lint. ============================== Use vdi controller instead of cluster manager for users and groups info Cluster manager stopped being domain joined for some reason. ============================== Paginate describe_instances when creating head node a record. Otherwise, may not find the cluster head node instance. ============================== Add default MungeKeySecret. This should be the default or you can't access multiple clusters from the same server. ============================== Increase timeout for ssm command that configures submitters Need the time to compile slurm. ============================== Force slurm to be rebuilt for submitters of all os distributions even if they match the os of the cluster. Otherwise get errors because can't find PluginDir in the same location as when it was compiled. ============================== Paginate describe_instances in UpdateHeadNode lambda ============================== Add check for min memory of 4 GB for slurm controller ============================== Update documentation. Remove Regions from InstanceConfig. This was left over from legacy cluster. ParallelCluster doesn't support multiple regions.
1 parent a8b6555 commit 58f70e7

File tree

63 files changed

+1639
-1534
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+1639
-1534
lines changed

docs/debug.md

Lines changed: 6 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,12 @@
11
# Debug
22

3-
## Log Files on File System
3+
For ParallelCluster and Slurm issues, refer to the official [AWS ParallelCluster Troubleshooting documentation](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html).
44

5-
Most of the key log files are stored on the Slurm file system so that they can be accessed from any instance with the file system mounted.
6-
7-
| Logfile | Description
8-
|---------|------------
9-
| `/opt/slurm/{{ClusterName}}/logs/nodes/{{node-name}}/slurmd.log` | Slurm daemon (slurmd) logfile
10-
| `/opt/slurm/{{ClusterName}}/logs/nodes/{{node-name}}/spot_monitor.log` | Spot monitor logfile
11-
| `/opt/slurm/{{ClusterName}}/logs/slurmctl[1-2]/cloudwatch.log` | Cloudwatch cron (slurm_ec2_publish_cw.py) logfile
12-
| `/opt/slurm/{{ClusterName}}/logs/slurmctl[1-2]/power_save.log` | Power saving API logfile
13-
| `/opt/slurm/{{ClusterName}}/logs/slurmctl[1-2]/slurmctld.log` | Slurm controller daemon (slurmctld) logfile
14-
| `/opt/slurm/{{ClusterName}}/logs/slurmctl[1-2]/terminate_old_instances.log` | Terminate old instances cron (terminate_old_instances.py) logfile
15-
| `/opt/slurm/{{ClusterName}}/logs/slurmdbd/slurmdbd.log` | Slurm database daemon (slurmdbd) logfile
16-
17-
## Slurm AMI Nodes
18-
19-
The Slurm AMI nodes build the Slurm binaries for all of the configured operating system (OS) variants.
20-
The Amazon Linux 2 build is a prerequisite for the Slurm controllers and slurmdbd instances.
21-
The other builds are prerequisites for compute nodes and submitters.
22-
23-
First check for errors in the user data script. The following command will show the output:
24-
25-
`grep cloud-init /var/log/messages | less`
26-
27-
The most common problem is that the ansible playbook failed.
28-
Check the ansible log file to see what failed.
29-
30-
`less /var/log/ansible.log`
31-
32-
The following command will rerun the user data.
33-
It will download the playbooks from the S3 deployment bucket and then run it to configure the instance.
34-
35-
`/var/lib/cloud/instance/scripts/part-001`
36-
37-
If the problem is with the ansible playbook, then you can edit it in /root/playbooks and then run
38-
your modified playbook by running the following command.
39-
40-
`/root/slurm_node_ami_config.sh`
41-
42-
## Slurm Controller
5+
## Slurm Head Node
436

447
If slurm commands hang, then it's likely a problem with the Slurm controller.
458

46-
The first thing to check is the controller's logfile which is stored on the Slurm file system.
47-
48-
`/opt/slurm/{{ClusterName}}/logs/nodes/slurmctl[1-2]/slurmctld.log`
49-
50-
If the logfile doesn't exist or is empty then you will need to connect to the slurmctl instance using SSM Manager or ssh and switch to the root user.
9+
Connect to the head node from the EC2 console using SSM Manager or ssh and switch to the root user.
5110

5211
`sudo su`
5312

@@ -59,24 +18,14 @@ If it isn't then first check for errors in the user data script. The following c
5918

6019
`grep cloud-init /var/log/messages | less`
6120

62-
The most common problem is that the ansible playbook failed.
63-
Check the ansible log file to see what failed.
21+
Then check the controller's logfile.
6422

65-
`less /var/log/ansible.log`
23+
`/var/log/slurmctld.log`
6624

6725
The following command will rerun the user data.
68-
It will download the playbooks from the S3 deployment bucket and then run it to configure the instance.
6926

7027
`/var/lib/cloud/instance/scripts/part-001`
7128

72-
If the problem is with the ansible playbook, then you can edit it in /root/playbooks and then run
73-
your modified playbook by running the following command.
74-
75-
`/root/slurmctl_config.sh`
76-
77-
The daemon may also be failing because of some other error.
78-
Check the `slurmctld.log` for errors.
79-
8029
Another way to debug the `slurmctld` daemon is to launch it interactively with debug set high.
8130
The first thing to do is get the path to the slurmctld binary.
8231

@@ -90,31 +39,6 @@ Then you can run slurmctld:
9039
$slurmctld -D -vvvvv
9140
```
9241

93-
### Slurm Controller Log Files
94-
95-
| Logfile | Description
96-
|---------|------------
97-
| `/var/log/ansible.log` | Ansible logfile
98-
| `/var/log/slurm/cloudwatch.log` | Logfile for the script that uploads CloudWatch events.
99-
| `/var/log/slurm/slurmctld.log` | slurmctld logfile
100-
| `/var/log/slurm/power_save.log` | Slurm plugin logfile with power saving scripts that start, stop, and terminated instances.
101-
| `/var/log/slurm/terminate_old_instances.log` | Logfile for the script that terminates stopped instances.
102-
103-
## Slurm Accounting Database (slurmdbd)
104-
105-
If you are having problems with the slurm accounting database connect to the slurmdbd instance using SSM Manager.
106-
107-
Check for cloud-init and ansible errors the same way as for the slurmctl instance.
108-
109-
Also check the `slurmdbd.log` for errors.
110-
111-
### Log Files
112-
113-
| Logfile | Description
114-
|---------|------------
115-
| `/var/log/ansible.log` | Ansible logfile
116-
| `/var/log/slurm/slurmdbd.log` | slurmctld logfile
117-
11842
## Compute Nodes
11943

12044
If there are problems with the compute nodes, connect to them using SSM Manager.
@@ -132,7 +56,7 @@ Check that the slurm daemon is running.
13256

13357
| Logfile | Description
13458
|---------|------------
135-
| `/var/log/slurm/slurmd.log` | slurmctld logfile
59+
| `/var/log/slurmd.log` | slurmctld logfile
13660

13761
## Job Stuck in Pending State
13862

docs/delete-cluster.md

Lines changed: 10 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,14 @@
1-
# Delete Cluster (legacy)
1+
# Delete Cluster
22

3-
Most of the resources can be deleted by simply deleting the cluster's CloudFormation stack.
4-
However, there a couple of resources that must be manually deleted:
3+
To delete the cluster all you need to do is delete the configuration CloudFormation stack.
4+
This will delete the ParallelCluster cluster and all of the configuration resources.
55

6-
* The Slurm RDS database
7-
* The Slurm file system
6+
If you specified RESEnvironmentName then it will also deconfigure the creation of `users_groups.json` and also deconfigure the VDI
7+
instances so they are no longer using the cluster.
88

9-
The deletion of the CloudFormation stack will fail because of these 2 resources and some resources that are used
10-
by them will also fail to delete.
11-
Manually delete the resources and then retry deleting the CloudFormation stack.
9+
If you deployed the Slurm database stack then you can keep that and use it for other clusters.
10+
If you don't need it anymore, then you can delete the stack.
11+
You will also need to manually delete the RDS database.
1212

13-
## Manually Delete RDS Database
14-
15-
If the database contains production data then it is highly recommended that you back up the data.
16-
You could also keep the database and use it for creating new clusters.
17-
18-
19-
Even after deleting the database CloudFormation may say that it failed to delete.
20-
Confirm in the RDS console that it deleted and then ignore the resource when retrying the stack deletion.
21-
22-
* Go the the RDS console
23-
* Select Databases on the left
24-
* Remove deletion protection
25-
* Select the cluster's database
26-
* Click `Modify`
27-
* Expand `Additional scaling configuration`
28-
* Uncheck `Scale the capacity to 0 ACIs when cluster is idle`
29-
* Uncheck `Enable deletion protection`
30-
* Click `Continue`
31-
* Select `Apply immediately`
32-
* Click `Modify cluster`
33-
* Delete the database
34-
* Select the cluster's database
35-
* Click `Actions` -> `Delete`
36-
* Click `Delete DB cluster`
37-
38-
## Manually delete the Slurm file system
39-
40-
### FSx for OpenZfs
41-
42-
* Go to the FSx console
43-
* Select the cluster's file system
44-
* Click `Actions` -> `Delete file system`
45-
* Click `Delete file system`
13+
If you deployed the ParallelCluster UI then you can keep it and use it with other clusters.
14+
If you don't need it anymore then you can delete the stack.

docs/deployment-prerequisites.md

Lines changed: 15 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -96,18 +96,18 @@ You should save your selections in the config file.
9696

9797
| Parameter | Description | Valid Values | Default
9898
|------------------------------------|-------------|--------------|--------
99-
| [StackName](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L221)] | The cloudformation stack that will deploy the cluster. | | None
100-
| [slurm/ClusterName](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L318-L320) | Name of the Slurm cluster | For ParallelCluster shouldn't be the same as StackName | | None
101-
| [Region](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L222-L223) | Region where VPC is located | | `$AWS_DEFAULT_REGION`
102-
| [VpcId](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L226-L227) | The vpc where the cluster will be deployed. | vpc-* | None
103-
| [SshKeyPair](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L224-L225) | EC2 Keypair to use for instances | | None
104-
| [slurm/SubmitterSecurityGroupIds](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L435-L439) | Existing security groups that can submit to the cluster. For SOCA this is the ComputeNodeSG* resource. | sg-* | None
105-
| [ErrorSnsTopicArn](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L233-L234) | ARN of an SNS topic that will be notified of errors | `arn:aws:sns:{{region}}:{AccountId}:{TopicName}` | None
106-
| [slurm/InstanceConfig](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L444-L509) | Configure instance types that the cluster can use and number of nodes. | | See [default_config.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/resources/config/default_config.yml)
99+
| [StackName](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L366-L367) | The cloudformation stack that will deploy the cluster. | | None
100+
| [slurm/ClusterName](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L447-L452) | Name of the Slurm cluster | For ParallelCluster shouldn't be the same as StackName | | None
101+
| [Region](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L368-L369) | Region where VPC is located | | `$AWS_DEFAULT_REGION`
102+
| [VpcId](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L372-L373) | The vpc where the cluster will be deployed. | vpc-* | None
103+
| [SshKeyPair](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L370-L371) | EC2 Keypair to use for instances | | None
104+
| [slurm/SubmitterSecurityGroupIds](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L480-L485) | Existing security groups that can submit to the cluster. For SOCA this is the ComputeNodeSG* resource. | sg-* | None
105+
| [ErrorSnsTopicArn](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L379-L380) | ARN of an SNS topic that will be notified of errors | `arn:aws:sns:{{region}}:{AccountId}:{TopicName}` | None
106+
| [slurm/InstanceConfig](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L491-L543) | Configure instance types that the cluster can use and number of nodes. | | See [default_config.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/resources/config/default_config.yml)
107107

108108
### Configure the Compute Instances
109109

110-
The [slurm/InstanceConfig](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L444-L509) configuration parameter configures the base operating systems, CPU architectures, instance families,
110+
The [slurm/InstanceConfig](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L491-L543) configuration parameter configures the base operating systems, CPU architectures, instance families,
111111
and instance types that the Slurm cluster should support.
112112
ParallelCluster currently doesn't support heterogeneous clusters;
113113
all nodes must have the same architecture and Base OS.
@@ -118,6 +118,7 @@ all nodes must have the same architecture and Base OS.
118118
| CentOS 7 | x86_64
119119
| RedHat 7 | x86_64
120120
| RedHat 8 | x86_64, arm64
121+
| Rocky 8 | x86_64, arm64
121122

122123
You can exclude instances types by family or specific instance type.
123124
By default the InstanceConfig excludes older generation instance families.
@@ -134,19 +135,16 @@ The disadvantage is higher cost if the instance is lightly loaded.
134135
The default InstanceConfig includes all supported base OSes and architectures and burstable and general purpose
135136
instance types.
136137

137-
* [default instance families](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L124-L166)
138-
* [default instance types](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L168-L173)
139-
* [default excluded instance families](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L175-L192)
140-
* [default excluded instance types](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L194-L197)
138+
* [default instance families](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L230-L271)
139+
* [default instance types](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L314-L319)
140+
* [default excluded instance families](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L321-L338)
141+
* [default excluded instance types](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L340-L343)
141142

142143
Note that instance types and families are python regular expressions.
143144

144145
```
145146
slurm:
146147
InstanceConfig:
147-
BaseOsArchitecture:
148-
CentOS:
149-
7: [x86_64]
150148
Include:
151149
InstanceFamilies:
152150
- t3.*
@@ -160,9 +158,6 @@ The following InstanceConfig configures instance types recommended for EDA workl
160158
```
161159
slurm:
162160
InstanceConfig:
163-
BaseOsArchitecture:
164-
CentOS:
165-
7: [x86_64]
166161
Include:
167162
InstanceFamilies:
168163
- c5.*
@@ -186,15 +181,6 @@ slurm:
186181
DefaultMinCount: 1
187182
```
188183

189-
The Legacy cluster also allows you to specify the names of specific nodes.
190-
191-
```
192-
slurm:
193-
InstanceConfig:
194-
AlwaysOnNodes:
195-
- nodename-[0-4]
196-
```
197-
198184
### Configure Fair Share Scheduling (Optional)
199185

200186
Slurm supports [fair share scheduling](https://slurm.schedmd.com/fair_tree.html), but it requires the fair share policy to be configured.
@@ -285,7 +271,7 @@ then jobs will stay pending in the queue until a job completes and frees up a li
285271
Combined with the fairshare algorithm, this can prevent users from monopolizing licenses and preventing others from
286272
being able to run their jobs.
287273

288-
Licenses are configured using the [slurm/Licenses](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L621-L629) configuration variable.
274+
Licenses are configured using the [slurm/Licenses](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L569-L577) configuration variable.
289275
If you are using the Slurm database then these will be configured in the database.
290276
Otherwises they will be configured in **/opt/slurm/{{ClusterName}}/etc/slurm_licenses.conf**.
291277

docs/onprem.md

Lines changed: 5 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# On-Premises Integration (legacy)
1+
# On-Premises Integration
22

3-
The slurm cluster can also be configured to manage on-premises compute nodes.
3+
The Slurm cluster can also be configured to manage on-premises compute nodes.
44
The user must configure the on-premises compute nodes and then give the configuration information.
55

66
## Network Requirements
@@ -20,6 +20,9 @@ All of the compute nodes in the cluster, including the on-prem nodes, must have
2020
This can involve mounting filesystems across VPN or Direct Connect or synchronizing file systems using tools like rsync or NetApp FlexCache or SnapMirror.
2121
Performance will dictate the architecture of the file system.
2222

23+
The onprem compute nodes must mount the Slurm controller's NFS export so that they have access to the Slurm binaries and configuration file.
24+
They must then be configured to run slurmd so that they can be managed by Slurm.
25+
2326
## Slurm Configuration of On-Premises Compute Nodes
2427

2528
The slurm cluster's configuration file allows the configuration of on-premises compute nodes.
@@ -29,21 +32,6 @@ All that needs to be configured are the configuration file for the on-prem nodes
2932

3033
```
3134
InstanceConfig:
32-
UseSpot: true
33-
DefaultPartition: CentOS_7_x86_64_spot
34-
NodesPerInstanceType: 10
35-
BaseOsArchitecture:
36-
CentOS: {7: [x86_64]}
37-
Include:
38-
MaxSizeOnly: false
39-
InstanceFamilies:
40-
- t3
41-
InstanceTypes: []
42-
Exclude:
43-
InstanceFamilies: []
44-
InstanceTypes:
45-
- '.+\.(micro|nano)' # Not enough memory
46-
- '.*\.metal'
4735
OnPremComputeNodes:
4836
ConfigFile: 'slurm_nodes_on_prem.conf'
4937
CIDR: '10.1.0.0/16'

0 commit comments

Comments
 (0)