You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Updates to allow submitters to use multiple ParallelCluster clusters … (#140)
* Updates to allow submitters to use multiple ParallelCluster clusters at the same time
Update the slurm config to add the ClusterName to the path.
Enable multiple clusters to be used by submitters
Change the mount point to include the cluster name so it is unique between clusters.
Updates scripts and config paths to include the cluster name in the paths.
Add a symbolic link so the head and compute nodes have access to the same
path as the submitter.
Resolves#139
* Add scripts and cron jobs to update users_groups.json
Add the commands to configure/deconfigure an instance that has access to
the users and groups so that the users_groups.json file can be created
and updated each hour.
* Fix fstab path, modulefiles config
Also rename the CfnOutput names so commands are in the order that they
are expected to run.
* Add LOCALDOMAIN to the modulefile
Fixes DNS resolution of cluster hostnames so srun doesn't fail.
* Update documentation
Put legacy docs at end.
Copy file name to clipboardExpand all lines: README.md
+13-11Lines changed: 13 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
This repository contains an AWS Cloud Development Kit (CDK) application that creates a Slurm cluster that is suitable for running production EDA workloads on AWS.
4
4
5
-
The original version of this repo used a custom Python plugin to integrate Slurm with AWS.
5
+
The original (legacy) version of this repo used a custom Python plugin to integrate Slurm with AWS.
6
6
The latest version of the repo uses AWS ParallelCluster for the core Slurm infrastructure and AWS integration.
7
7
The big advantage of moving to AWS ParallelCluster is that it is a supported AWS service.
8
8
Currently, some of the features of the legacy version are not supported in the ParallelCluster version, but
@@ -16,29 +16,32 @@ Key features are supported by both versions are:
16
16
* Handling of spot terminations
17
17
* Handling of insufficient capacity exceptions
18
18
* Batch and interactive partitions (queues)
19
-
*Managed tool licenses as a consumable resource
19
+
*Manages tool licenses as a consumable resource
20
20
* User and group fair share scheduling
21
21
* Slurm accounting database
22
22
* CloudWatch dashboard
23
23
* Job preemption
24
24
* Manage on-premises compute nodes
25
-
* Configure partitions (queues) and nodes that are always on to support reserved instances RIs and savings plans.
25
+
* Configure partitions (queues) and nodes that are always on to support reserved instances (RIs) and savings plans (SPs).
26
26
27
27
Features in the legacy version and not in the ParallelCluster version:
28
28
29
-
* Multi-AZ support. Supported by ParallelCluster, but not implemented.
29
+
* Heterogenous clusters with mixed OSes and CPU architectures on compute nodes.
30
+
* Multi-AZ support. Supported by ParallelCluster, but not currently implemented.
31
+
* Multi-region support
30
32
* AWS Fault Injection Simulator (FIS) templates to test spot terminations
31
-
* Heterogenous cluster with mixed OSes and CPU architectures on compute nodes.
32
33
* Support for MungeKeySsmParameter
33
34
* Multi-cluster federation
34
-
* Multi-region support
35
35
36
36
ParallelCluster Limitations
37
37
38
-
* Number of "Compute Resources" is limited to 50 which limits the number of instance types allowed in a cluster.
38
+
* Number of "Compute Resources" (CRs) is limited to 50 which limits the number of instance types allowed in a cluster.
39
+
ParallelCluster can have multiple instance types in a CR, but with memory based scheduling enabled, they must all have the same number of cores and amount of memory.
39
40
* All Slurm instances must have the same OS and CPU architecture.
* Multi-region support. This is unlikely to change because multi-region services run against our archiectural philosophy. Federation may be a better option
42
+
* Multi-region support. This is unlikely to change because multi-region services run against our archiectural philosophy.
43
+
Federation may be an option but its current implementation limits scheduler performance and doesn't allow cluster prioritization so jobs land on
44
+
random clusters.
42
45
43
46
Slurm Limitations
44
47
@@ -96,14 +99,13 @@ Legacy:
96
99
* Rocky Linux 8 and arm64
97
100
* Rocky Linux 8 and x86_64
98
101
99
-
Note that in the ParallelCluster version all compute nodes must have the same OS and architecture.
102
+
Note that in the ParallelCluster version, all compute nodes must have the same OS and architecture.
100
103
101
104
## Documentation
102
105
103
106
[View on GitHub Pages](https://aws-samples.github.io/aws-eda-slurm-cluster/)
104
107
105
-
To view the docs locally, clone the repository and run mkdocs:
106
-
108
+
You can also view the docs locally,
107
109
The docs are in the docs directory. You can view them in an editor or using the mkdocs tool.
108
110
109
111
I recommend installing mkdocs in a python virtual environment.
@@ -38,24 +44,25 @@ remediation or create a support ticket.
38
44
39
45
## Deploy Using ParallelCluster
40
46
41
-
### Create ParallelCluster UI
47
+
### Create ParallelCluster UI (optional but recommended)
42
48
43
49
It is highly recommended to create a ParallelCluster UI to manage your ParallelCluster clusters.
44
50
A different UI is required for each version of ParallelCluster that you are using.
45
51
The versions are list in the [ParallelCluster Release Notes](https://docs.aws.amazon.com/parallelcluster/latest/ug/document_history.html).
46
52
The minimum required version is 3.6.0 which adds support for RHEL 8 and increases the number of allows queues and compute resources.
47
-
The suggested version is at least 3.7.0 because it adds configurate compute node weights which we use to prioritize the selection of
53
+
The suggested version is at least 3.7.0 because it adds configurable compute node weights which we use to prioritize the selection of
48
54
compute nodes by their cost.
49
55
50
56
The instructions are in the [ParallelCluster User Guide](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-pcui-v3.html).
51
57
52
58
### Create ParallelCluster Slurm Database
53
59
54
60
The Slurm Database is required for configuring Slurm accounts, users, groups, and fair share scheduling.
55
-
It you need these and other features then you will need to create ParallelCluster Slurm Database.
61
+
It you need these and other features then you will need to create a ParallelCluster Slurm Database.
62
+
You do not need to create a new database for each cluster; multiple clusters can share the same database.
56
63
Follow the directions in this [ParallelCluster tutorial to configure slurm accounting](https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3).
57
64
58
-
### Configuration File
65
+
### Create Configuration File
59
66
60
67
The first step in deploying your cluster is to create a configuration file.
61
68
A default configuration file is found in [source/resources/config/default_config.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/resources/config/default_config.yml).
@@ -170,10 +177,72 @@ with command line arguments, however it is better to specify all of the paramete
The original (legacy) version used a custom Slurm plugin for orchestrating the EC2 compute nodes.
4
+
The latest version uses ParallelCluster to provision the core Slurm infrastructure.
5
+
When using ParallelCluster, a ParallelCluster configuration will be generated and used to create a ParallelCluster slurm cluster.
6
+
The first supported ParallelCluster version is 3.6.0.
7
+
Version 3.7.0 is the recommended minimum version because it support compute node weighting that is proportional to instance type
8
+
cost so that the least expensive instance types that meet job requirements are used.
9
+
10
+
## Prerequisites
11
+
12
+
See [Deployment Prerequisites](deployment-prerequisites.md) page.
13
+
14
+
The following are prerequisites that are specific to ParallelCluster.
15
+
16
+
### Create ParallelCluster UI (optional but recommended)
17
+
18
+
It is highly recommended to create a ParallelCluster UI to manage your ParallelCluster clusters.
19
+
A different UI is required for each version of ParallelCluster that you are using.
20
+
The versions are list in the [ParallelCluster Release Notes](https://docs.aws.amazon.com/parallelcluster/latest/ug/document_history.html).
21
+
The minimum required version is 3.6.0 which adds support for RHEL 8 and increases the number of allows queues and compute resources.
22
+
The suggested version is at least 3.7.0 because it adds configurable compute node weights which we use to prioritize the selection of
23
+
compute nodes by their cost.
24
+
25
+
The instructions are in the [ParallelCluster User Guide](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-pcui-v3.html).
26
+
27
+
### Create ParallelCluster Slurm Database
28
+
29
+
The Slurm Database is required for configuring Slurm accounts, users, groups, and fair share scheduling.
30
+
It you need these and other features then you will need to create a ParallelCluster Slurm Database.
31
+
You do not need to create a new database for each cluster; multiple clusters can share the same database.
32
+
Follow the directions in this [ParallelCluster tutorial to configure slurm accounting](https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3).
33
+
34
+
## Create the Cluster
35
+
36
+
To install the cluster run the install script. You can override some parameters in the config file
37
+
with command line arguments, however it is better to specify all of the parameters in the config file.
0 commit comments