basic CUDA <> CPU or CUDA <> CUDA rdma Support #372

dstaay-fb · 2025-06-27T20:25:49Z

Summary:
RDMA support for CUDA <> CUDA and CUDA <> CPU comms

Key changes

using cuda apis we can detect if a given pointer is mapped to a cuda device, or cpu.
if data pointer is cuda, the code leverages dma registration to register with NIC; we are able to avoid directly passing with cuda allocation handles using cuMemGetHandleForAddressRange.
if data pointer is cpu, we use standard ibv mr; note I transitioned to using standard registration, not entire memory space (security concern raised by mariusae)
Refactored test infra to support named NIC devices, and different compute (cuda:X or cpu)

This implementation is relatively naive, and I will iterate accordingly.

To Do: add unit test for cuda/cuda

Differential Revision: D77408653

Summary: exposes basic cuda bindings to monarch for rdma support Differential Revision: D77404103

Summary: expose rdmacore bindings; including basic ibv verbs along with mlx5dv prodivers to monarch for rdma. Differential Revision: D77408652

Summary: RDMA support for CUDA <> CUDA and CUDA <> CPU comms Key changes - using cuda apis we can detect if a given pointer is mapped to a cuda device, or cpu. - if data pointer is cuda, the code leverages dma registration to register with NIC; we are able to avoid directly passing with cuda allocation handles using cuMemGetHandleForAddressRange. - if data pointer is cpu, we use standard ibv mr; note I transitioned to using standard registration, not entire memory space (security concern raised by mariusae) - Refactored test infra to support named NIC devices, and different compute (cuda:X or cpu) This implementation is relatively naive, and I will iterate accordingly. To Do: add unit test for cuda/cuda Differential Revision: D77408653

facebook-github-bot · 2025-06-27T20:26:09Z

This pull request was exported from Phabricator. Differential Revision: D77408653

dstaay-fb added 3 commits June 27, 2025 13:25

cuda-sys generated bindings

befebad

Summary: exposes basic cuda bindings to monarch for rdma support Differential Revision: D77404103

Create custom RdmaCore-sys bindings

68947b6

Summary: expose rdmacore bindings; including basic ibv verbs along with mlx5dv prodivers to monarch for rdma. Differential Revision: D77408652

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 27, 2025

facebook-github-bot added the fb-exported label Jun 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

basic CUDA <> CPU or CUDA <> CUDA rdma Support #372

basic CUDA <> CPU or CUDA <> CUDA rdma Support #372

Uh oh!

dstaay-fb commented Jun 27, 2025

Uh oh!

facebook-github-bot commented Jun 27, 2025

Uh oh!

Uh oh!

basic CUDA <> CPU or CUDA <> CUDA rdma Support #372

Are you sure you want to change the base?

basic CUDA <> CPU or CUDA <> CUDA rdma Support #372

Uh oh!

Conversation

dstaay-fb commented Jun 27, 2025

Uh oh!

facebook-github-bot commented Jun 27, 2025

Uh oh!

Uh oh!