Skip to content

awsbatch: add AWS::Batch::JobDefinition SharedMemorySize #4261

@tportwood

Description

@tportwood

Feature request

AWS ParallelCluster version 3.1.2

AWS Batch Scheduler

If I understand correctly, OpenMPI processes on the same node will use the shared memory BTL ('vader' or 'sm') by default.

We've found that the docker container default shared memory size of 64 megabytes is not enough for the OpenMPI shared memory BTL when we run our model with a higher number of processes on larger instances like *.18xlarge, *.24xlarge.

The shortage causes our model to fail with errors stating: Program received signal SIGBUS: Access to an undefined portion of a memory object.

Increasing the size of shared memory for the container(s) by manually updating the AWS::Batch::JobDefinition in the ParallelCluster stack fixes the issue.

It would be great if ParallelCluster allowed us to configure the shared memory size for the container(s) so we can use larger instances easily, like:

Scheduling:
  Scheduler: awsbatch
  AwsBatchQueues:
  - Name: my-queue
    ComputeResources:
    - Name: my-compute-resource
      InstanceTypes:
      - c5.18xlarge
      MinvCpus: 0
      DesiredvCpus: 0
      MaxvCpus: 360
      SharedMemorySize: 1024

I'm happy to submit a PR if this seems like a simple addition.

Please pardon my ignorance if I'm not understanding the issue correctly, or if there is a better approach.

Helpful blog -> using-shared-memory-for-low-latency-intra-node-communication-in-aws-batch

Metadata

Metadata

Assignees

No one assigned

    Labels

    awsbatchAWS Batch related issue or FE

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions