-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to accelerate for Multi GPU #87
Comments
If we consider the constrained problem of running gpu jobs within a single pod, where each gpu is handled by a single process. There are the following options where mutiple processes are run:
There is also the third option where the processes are distributed across multiple Kube pods, but this may be over-complex. This would be the standard Kubeflow training operator approach. Huggingface's recommendation is to run distributed training jobs using accelerate.
Option 1:
Option 2:
|
The "PyTorchJob" operator/CR from standard Kubeflow training operator allows us to run multiple processes within single container in a pod (Master pod) like the option 1 when we just want to run a multi-gpu single node training job. When we wish to spawn multi-node multi-gpu job, then we would leverage the worker pod where distributed environment variables (node rank, master address, port etc) are automatically injected by the operator. We just simply replicate accelerate launch in the worker pod and the node rank from the operator determines whether the pod is a worker pod or not. Also there is local rank created by torch.distributed which differentiates between all the processes. In option 1, AFAIK, most of the popular container runtimes are multiprocess friendly and on the resource side, the resource requests and limits are container level in Kubernetes. |
thanks for findings. We are working on the kubernetes solution in platform team in the next week.
we will be testing it with kubeflow training operator. we will update when work is done as part of issue #88 |
Is your feature request related to a problem? Please describe.
Accelerate https://huggingface.co/docs/transformers/en/accelerate is created by HF to help users easily train a Transformers model on any type of distributed setup, whether it is multiple GPU’s on one machine or multiple GPU’s across several machines. We should leverage the library for its ease of use.
Describe the solution you'd like
Describe alternatives you've considered
torchrun used currently can be less user friendly . See #80
(@fabianlim thanks for the suggestions)
The text was updated successfully, but these errors were encountered: