Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to configure shared memory for multi GPU deployments #5

Open
jsuchome opened this issue Mar 5, 2025 · 0 comments
Open

Comments

@jsuchome
Copy link

jsuchome commented Mar 5, 2025

The documentation states

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2

is a way to enable multi-GPU tensor parallelism. However one must think how the processes (?) communicate together, usually there's a shared memory setup needed. And if this is not properly set, one might run into issues like:

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.cpp:81, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
Error while creating shared memory segment /dev/shm/nccl-vzIpS6 (size 9637888)

when running sglang server.

This means the size of shared memory is too low.

When running in docker containers, this could be set up with --shm-size flag (see vllm's doc at https://docs.vllm.ai/en/latest/deployment/docker.html)

When running in kubernetes, it's possible that the default size for shared memory will not be enough for your containers, so one might need to set up bigger size. Common way to do it is mount /dev/shm as emptyDir and set up proper sizeLimit. Like this:

    spec:
      containers:
      - command:
        ... < your usual container setup > ...
        volumeMounts:
        - mountPath: /dev/shm
          name: shared
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 1Gi
        name: shared

I have found out that vllm project recommends 20Gi as a default value for the shared memory size, see vllm-project/production-stack#44 and their helm chart value https://github.com/vllm-project/production-stack/pull/105/files#diff-7d931e53fe7db67b34609c58ca5e5e2788002e7f99657cc2879c7957112dd908R130

However I'm not sure where does this number come from. I was testing on the node with 2 NVIDIA L40 GPU's with DeepSeek-R1-Distill-Qwen-32B model, and having 1GiB of shared memory seemed enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant