awsbatch: add AWS::Batch::JobDefinition SharedMemorySize

**Feature request**

**AWS ParallelCluster version 3.1.2**

**AWS Batch Scheduler**

If I understand correctly, OpenMPI processes on the same node will use the [shared memory BTL ('vader' or 'sm')](https://docs.open-mpi.org/en/v5.0.x/networking/shared-memory.html) by default.

We've found that the [docker container default shared memory size of 64 megabytes](https://docs.docker.com/engine/reference/run/#runtime-constraints-on-resources) is not enough for the OpenMPI shared memory BTL when we run our model with a higher number of processes on larger instances like *.18xlarge, *.24xlarge. 

The shortage causes our model to fail with errors stating: `Program received signal SIGBUS: Access to an undefined portion of a memory object.`

Increasing the size of shared memory for the container(s) by manually updating the [AWS::Batch::JobDefinition](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-batch-jobdefinition-containerproperties-linuxparameters.html#cfn-batch-jobdefinition-containerproperties-linuxparameters-sharedmemorysize) in the ParallelCluster stack fixes the issue.

It would be great if ParallelCluster allowed us to configure the shared memory size for the container(s) so we can use larger instances easily, like: 

```
Scheduling:
  Scheduler: awsbatch
  AwsBatchQueues:
  - Name: my-queue
    ComputeResources:
    - Name: my-compute-resource
      InstanceTypes:
      - c5.18xlarge
      MinvCpus: 0
      DesiredvCpus: 0
      MaxvCpus: 360
      SharedMemorySize: 1024
```

I'm happy to submit a PR if this seems like a simple addition.

Please pardon my ignorance if I'm not understanding the issue correctly, or if there is a better approach.

Helpful blog -> [using-shared-memory-for-low-latency-intra-node-communication-in-aws-batch](https://aws.amazon.com/blogs/compute/using-shared-memory-for-low-latency-intra-node-communication-in-aws-batch/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

awsbatch: add AWS::Batch::JobDefinition SharedMemorySize #4261

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

awsbatch: add AWS::Batch::JobDefinition SharedMemorySize #4261

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions