-
Notifications
You must be signed in to change notification settings - Fork 314
Description
Feature request
AWS ParallelCluster version 3.1.2
AWS Batch Scheduler
If I understand correctly, OpenMPI processes on the same node will use the shared memory BTL ('vader' or 'sm') by default.
We've found that the docker container default shared memory size of 64 megabytes is not enough for the OpenMPI shared memory BTL when we run our model with a higher number of processes on larger instances like *.18xlarge, *.24xlarge.
The shortage causes our model to fail with errors stating: Program received signal SIGBUS: Access to an undefined portion of a memory object.
Increasing the size of shared memory for the container(s) by manually updating the AWS::Batch::JobDefinition in the ParallelCluster stack fixes the issue.
It would be great if ParallelCluster allowed us to configure the shared memory size for the container(s) so we can use larger instances easily, like:
Scheduling:
Scheduler: awsbatch
AwsBatchQueues:
- Name: my-queue
ComputeResources:
- Name: my-compute-resource
InstanceTypes:
- c5.18xlarge
MinvCpus: 0
DesiredvCpus: 0
MaxvCpus: 360
SharedMemorySize: 1024
I'm happy to submit a PR if this seems like a simple addition.
Please pardon my ignorance if I'm not understanding the issue correctly, or if there is a better approach.
Helpful blog -> using-shared-memory-for-low-latency-intra-node-communication-in-aws-batch