Skip to content

Add NIXL backend #6016

Draft
x41lakazam wants to merge 10 commits intomainfrom
dispatch_combine/nixl_backend
Draft

Add NIXL backend #6016
x41lakazam wants to merge 10 commits intomainfrom
dispatch_combine/nixl_backend

Conversation

@x41lakazam
Copy link
Collaborator

No description provided.

@github-actions
Copy link

github-actions bot commented Feb 26, 2026

Review updated until commit f8a94fc

Description

  • Add NIXL backend implementation for GPU tensor transfers using UCX

  • Implement memory registration, metadata exchange, and transfer operations

  • Add comprehensive test suite for NIXL backend functionality

  • Integrate NIXL build support into CMake and Python build system

Changes walkthrough

Relevant files
Enhancement
5 files
nixl.cpp
Complete NIXL backend implementation with UCX integration
+522/-0 
nixl.h
NIXL backend header with API definitions and tensor utilities
+232/-0 
communicator.h
Add NIXL backend availability check                                           
+12/-1   
multidevice.h
Add kNixl to CommunicatorBackend enum                                       
+1/-1     
communicator.cpp
Add NIXL case to communicator backend output                         
+3/-0     
Tests
1 files
test_multidevice_nixl.cpp
Comprehensive test suite for NIXL backend functionality   
+289/-0 
Configuration changes
2 files
CMakeLists.txt
Add NIXL build configuration and test integration               
+39/-0   
utils.py
Add NIXL build configuration support                                         
+4/-0     
Documentation
1 files
setup.py
Document NIXL build environment variable                                 
+3/-0     

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Resource Management

The NixlTransferHandleImpl destructor is defaulted but the NIXL transfer handle (xfer_handle) may need explicit cleanup. Need to verify proper resource cleanup and potential memory leaks.

class NixlTransferHandleImpl {
 public:
#ifdef USE_NIXL
  // TODO - is it leaking when handleimpl is destroyed ? 
  nixlXferReqH* xfer_handle = nullptr;
#endif
  bool prepared = false;
  bool posted = false;
};
Spin Loop Performance

The waitTransfer method uses a busy-wait spin loop (lines 443-449) which may consume excessive CPU cycles. Consider adding a small sleep or using a more efficient waiting mechanism.

  // TODO - check this spin loop
  NixlXferStatus xfer_status;
  do {
    xfer_status = getTransferStatus(handle);
    NVF_ERROR(
        xfer_status != NixlXferStatus::kError,
        "NIXL transfer completed with an error");
  } while (xfer_status == NixlXferStatus::kInProgress);

  handle.impl_->posted = false;
#else
  (void)handle;
  NVF_THROW("NIXL support not compiled (USE_NIXL not defined)");
#endif
}
Error Handling Robustness

The UCX backend probe mechanism (lines 166-217) silently marks the backend as unavailable on failures. Consider adding more detailed logging or error reporting to help users understand why NIXL backend initialization failed.

{
  constexpr int64_t kProbeBytes = 64;
  auto probe = at::empty(
      {kProbeBytes},
      at::TensorOptions().dtype(at::kByte).device(
          at::kCUDA, communicator_.deviceId()));
  size_t nbytes = static_cast<size_t>(probe.nbytes());
  uintptr_t addr = reinterpret_cast<uintptr_t>(probe.data_ptr());
  uint32_t dev_idx = static_cast<uint32_t>(probe.device().index());

  std::cerr << "[NixlBackend probe] device=" << dev_idx
            << " addr=0x" << std::hex << addr << std::dec
            << " nbytes=" << nbytes
            << " numel=" << probe.numel()
            << " element_size=" << probe.element_size() << std::endl;

  NVF_ERROR(nbytes > 0, "NIXL probe: unexpected zero-byte tensor");
  NVF_ERROR(addr != 0, "NIXL probe: null data pointer");

  nixl_reg_dlist_t reg_dlist(VRAM_SEG);
  reg_dlist.addDesc({addr, nbytes, static_cast<uint64_t>(dev_idx)});

  std::cerr << "[NixlBackend probe] reg_dlist desc: addr=0x" << std::hex
            << reg_dlist[0].addr << std::dec
            << " len=" << reg_dlist[0].len
            << " devId=" << reg_dlist[0].devId << std::endl;

  nixl_status_t reg_status = agent_->registerMem(reg_dlist);
  std::cerr << "[NixlBackend probe] registerMem returned "
            << reg_status << std::endl;
  if (reg_status != NIXL_SUCCESS) {
    return;
  }

  nixl_xfer_dlist_t xfer_dlist(VRAM_SEG);
  xfer_dlist.addDesc({addr, nbytes, static_cast<uint64_t>(dev_idx)});

  nixlDlistH* dlist_handle = nullptr;
  nixl_status_t prep_status =
      agent_->prepXferDlist(NIXL_INIT_AGENT, xfer_dlist, dlist_handle);
  std::cerr << "[NixlBackend probe] prepXferDlist returned "
            << prep_status << std::endl;

  if (dlist_handle) {
    agent_->releasedDlistH(dlist_handle);
  }
  agent_->deregisterMem(reg_dlist);

  if (prep_status != NIXL_SUCCESS) {
    return;
  }
}

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! This looks great.
Here are some comments requesting some minor changes or explanation. The only point I'm a bit worried about is that we need a way to make the "wait" not blocking for cpu.

Comment on lines +123 to +127
#ifdef USE_NIXL
return true;
#else
return false;
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic LGTM but for consistency in the implementation please mimick what is done with nccl_available_ and ucc_available_

};

// ------------------------------------------------------------------
// Todo - those functions should be moved to a more global file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These helper functions are only used in csrc/multidevice/nixl.cpp file so let's not move their declaration up for now, and even, I would put those definitions in the .cpp file, not the header. Btw, our convention is to put static definitions (I mean definitions only used in their file, not linked against) inside an anonymous namespace:

namespace {

void helper() {
...
}

} // namespace

void exportedFonction() {
   ...
   helper();
   ...
}

TensorDesc though might be needed in the header (IIUC we cannot communicate at::Tensor outside the nvLink domain otherwise providing the at::Tensor::device crashes). Imo we can leave it here for now, later we could move it up ipc_utils.h or multidevice.h if we feel the need.

};
}

inline at::Tensor fromTensorDesc(const TensorDesc& desc) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used

);
}

inline std::vector<uint8_t> serializeTensorsDescs(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of implementing seerialize/deserialize, could you instead use fonctions in

template <typename T>
std::vector<uint8_t> toBytes(const T& data) {
return std::vector<uint8_t>(
reinterpret_cast<const uint8_t*>(&data),
reinterpret_cast<const uint8_t*>(&data) + sizeof(T));
}
template <typename T>
const T& fromBytes(const std::vector<uint8_t>& bytes) {
return *reinterpret_cast<const T*>(bytes.data());
}

uint32_t dev;
};
static_assert(std::is_trivially_copyable_v<TensorDesc>,
"TensorDesc must be trivially copyable for serialization");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's needed

}

void NixlBackend::cleanup() {
cleaned_up_ = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naybe reuse available_ here ?


void NixlBackend::cleanup() {
cleaned_up_ = true;
impl_.reset();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is it defined?

};

NixlTransferHandle::NixlTransferHandle() = default;
NixlTransferHandle::~NixlTransferHandle() = default;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory leak ?

NVF_THROW("Failed to create UCX backend for NIXL agent");
}

// Probe: verify that VRAM (CUDA GPU memory) is actually usable with
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that really necessary? It looks suspicious to me, can you help me understand?

#endif
}

void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wait function is cpu blocking so in practice it is more or less unusable in our context. Do you have an idea how to make this not blocking for cpu -- and ideally cuda-graph capturable?

@samnordmann
Copy link
Collaborator

@x41lakazam Can you provide instructions on how to build nixl, say, from pjnl docker image? We'll probably need to think how to add the library is the base image and/or the CI, unless it is already shipped in some DLFW package

@samnordmann
Copy link
Collaborator

samnordmann commented Feb 26, 2026

unless it is already shipped in some DLFW package

https://nvidia.slack.com/archives/C08KL9MNQ3U/p1771951941351029

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants