Skip to content

Add accelerator API to RPC distributed examples: ddp_rpc, parameter_server, rnn #1371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jafraustro
Copy link
Contributor

Add accelerator API to RPC distributed examples:

  • ddp_rpc
  • parameter_server
  • rnn

CC: @soumith

- ddp_rpc
- parameter_server
- rnn

Signed-off-by: jafraustro <[email protected]>
Copy link

netlify bot commented Jul 14, 2025

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit a84f91c
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-examples-preview/deploys/68798280b39c080008fc743c

@jafraustro jafraustro marked this pull request as ready for review July 14, 2025 16:34
@soumith
Copy link
Member

soumith commented Jul 15, 2025

failing CI

@jafraustro
Copy link
Contributor Author

I added numpy to requirement.txt files

@jafraustro jafraustro closed this Jul 15, 2025
@jafraustro jafraustro reopened this Jul 15, 2025
@soumith
Copy link
Member

soumith commented Jul 16, 2025

still failing :D

- Added a function to verify minimum GPU count before execution.
- Updated HybridModel initialization to use rank instead of device.
- Ensured proper cleanup of the process group to avoid resource leaks.
- Added exit message if insufficient GPUs are detected.

Signed-off-by: jafraustro <[email protected]>
@jafraustro
Copy link
Contributor Author

Hi @soumith,

DDP step needs two gpu's.

Fix:

  • Added verify_min_gpu_count() function to check for sufficient GPU resources.
  • Updated the HybridModel class to use rank-based device assignment instead of generic device handling, improving device placement consistency across distributed processes.
  • Implemented proper cleanup by adding dist.destroy_process_group() calls for trainer processes,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants