Add NIXL backend by x41lakazam · Pull Request #6016 · NVIDIA/Fuser

x41lakazam · 2026-02-26T08:30:36Z

No description provided.

github-actions · 2026-02-26T08:54:21Z

Review updated until commit f8a94fc

Description

Add NIXL backend implementation for GPU tensor transfers using UCX
Implement memory registration, metadata exchange, and transfer operations
Add comprehensive test suite for NIXL backend functionality
Integrate NIXL build support into CMake and Python build system

Changes walkthrough

Relevant files

Enhancement

5 files

nixl.cpp `Complete NIXL backend implementation with UCX integration`	+522/-0
nixl.h `NIXL backend header with API definitions and tensor utilities`	+232/-0
communicator.h `Add NIXL backend availability check`	+12/-1
multidevice.h `Add kNixl to CommunicatorBackend enum`	+1/-1
communicator.cpp `Add NIXL case to communicator backend output`	+3/-0

Tests

1 files

test_multidevice_nixl.cpp `Comprehensive test suite for NIXL backend functionality`	+289/-0

Configuration changes

2 files

CMakeLists.txt `Add NIXL build configuration and test integration`	+39/-0
utils.py `Add NIXL build configuration support`	+4/-0

Documentation

1 files

setup.py `Document NIXL build environment variable`	+3/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Resource Management

The NixlTransferHandleImpl destructor is defaulted but the NIXL transfer handle (xfer_handle) may need explicit cleanup. Need to verify proper resource cleanup and potential memory leaks.

class NixlTransferHandleImpl {
 public:
#ifdef USE_NIXL
  // TODO - is it leaking when handleimpl is destroyed ? 
  nixlXferReqH* xfer_handle = nullptr;
#endif
  bool prepared = false;
  bool posted = false;
};

Spin Loop Performance

The waitTransfer method uses a busy-wait spin loop (lines 443-449) which may consume excessive CPU cycles. Consider adding a small sleep or using a more efficient waiting mechanism.

  // TODO - check this spin loop
  NixlXferStatus xfer_status;
  do {
    xfer_status = getTransferStatus(handle);
    NVF_ERROR(
        xfer_status != NixlXferStatus::kError,
        "NIXL transfer completed with an error");
  } while (xfer_status == NixlXferStatus::kInProgress);

  handle.impl_->posted = false;
#else
  (void)handle;
  NVF_THROW("NIXL support not compiled (USE_NIXL not defined)");
#endif
}

Error Handling Robustness

The UCX backend probe mechanism (lines 166-217) silently marks the backend as unavailable on failures. Consider adding more detailed logging or error reporting to help users understand why NIXL backend initialization failed.

{
  constexpr int64_t kProbeBytes = 64;
  auto probe = at::empty(
      {kProbeBytes},
      at::TensorOptions().dtype(at::kByte).device(
          at::kCUDA, communicator_.deviceId()));
  size_t nbytes = static_cast<size_t>(probe.nbytes());
  uintptr_t addr = reinterpret_cast<uintptr_t>(probe.data_ptr());
  uint32_t dev_idx = static_cast<uint32_t>(probe.device().index());

  std::cerr << "[NixlBackend probe] device=" << dev_idx
            << " addr=0x" << std::hex << addr << std::dec
            << " nbytes=" << nbytes
            << " numel=" << probe.numel()
            << " element_size=" << probe.element_size() << std::endl;

  NVF_ERROR(nbytes > 0, "NIXL probe: unexpected zero-byte tensor");
  NVF_ERROR(addr != 0, "NIXL probe: null data pointer");

  nixl_reg_dlist_t reg_dlist(VRAM_SEG);
  reg_dlist.addDesc({addr, nbytes, static_cast<uint64_t>(dev_idx)});

  std::cerr << "[NixlBackend probe] reg_dlist desc: addr=0x" << std::hex
            << reg_dlist[0].addr << std::dec
            << " len=" << reg_dlist[0].len
            << " devId=" << reg_dlist[0].devId << std::endl;

  nixl_status_t reg_status = agent_->registerMem(reg_dlist);
  std::cerr << "[NixlBackend probe] registerMem returned "
            << reg_status << std::endl;
  if (reg_status != NIXL_SUCCESS) {
    return;
  }

  nixl_xfer_dlist_t xfer_dlist(VRAM_SEG);
  xfer_dlist.addDesc({addr, nbytes, static_cast<uint64_t>(dev_idx)});

  nixlDlistH* dlist_handle = nullptr;
  nixl_status_t prep_status =
      agent_->prepXferDlist(NIXL_INIT_AGENT, xfer_dlist, dlist_handle);
  std::cerr << "[NixlBackend probe] prepXferDlist returned "
            << prep_status << std::endl;

  if (dlist_handle) {
    agent_->releasedDlistH(dlist_handle);
  }
  agent_->deregisterMem(reg_dlist);

  if (prep_status != NIXL_SUCCESS) {
    return;
  }
}

samnordmann

Thank you very much! This looks great.
Here are some comments requesting some minor changes or explanation. The only point I'm a bit worried about is that we need a way to make the "wait" not blocking for cpu.

samnordmann · 2026-02-26T09:00:41Z

csrc/multidevice/communicator.h

+#ifdef USE_NIXL
+      return true;
+#else
+      return false;
+#endif


The logic LGTM but for consistency in the implementation please mimick what is done with nccl_available_ and ucc_available_

samnordmann · 2026-02-26T09:05:46Z

csrc/multidevice/nixl.h

+};
+
+// ------------------------------------------------------------------
+// Todo - those functions should be moved to a more global file


These helper functions are only used in csrc/multidevice/nixl.cpp file so let's not move their declaration up for now, and even, I would put those definitions in the .cpp file, not the header. Btw, our convention is to put static definitions (I mean definitions only used in their file, not linked against) inside an anonymous namespace:

namespace { void helper() { ... } } // namespace void exportedFonction() { ... helper(); ... }

TensorDesc though might be needed in the header (IIUC we cannot communicate at::Tensor outside the nvLink domain otherwise providing the at::Tensor::device crashes). Imo we can leave it here for now, later we could move it up ipc_utils.h or multidevice.h if we feel the need.

samnordmann · 2026-02-26T09:09:19Z

csrc/multidevice/nixl.h

+  };
+}
+
+inline at::Tensor fromTensorDesc(const TensorDesc& desc) {


samnordmann · 2026-02-26T09:10:58Z

csrc/multidevice/nixl.h

+  );
+}
+
+inline std::vector<uint8_t> serializeTensorsDescs(


instead of implementing seerialize/deserialize, could you instead use fonctions in

Fuser/csrc/multidevice/ipc_utils.h

Lines 18 to 28 in 0cfc24b

template <typename T>

std::vector<uint8_t> toBytes(const T& data) {

return std::vector<uint8_t>(

reinterpret_cast<const uint8_t*>(&data),

reinterpret_cast<const uint8_t*>(&data) + sizeof(T));

}

template <typename T>

const T& fromBytes(const std::vector<uint8_t>& bytes) {

return *reinterpret_cast<const T*>(bytes.data());

}

samnordmann · 2026-02-26T09:14:43Z

csrc/multidevice/nixl.h

+  uint32_t dev;
+};
+static_assert(std::is_trivially_copyable_v<TensorDesc>,
+  "TensorDesc must be trivially copyable for serialization");


I don't think that's needed

samnordmann · 2026-02-26T16:01:49Z

csrc/multidevice/nixl.cpp

+}
+
+void NixlBackend::cleanup() {
+  cleaned_up_ = true;


naybe reuse available_ here ?

samnordmann · 2026-02-26T16:03:39Z

csrc/multidevice/nixl.cpp

+
+void NixlBackend::cleanup() {
+  cleaned_up_ = true;
+  impl_.reset();


where is it defined?

samnordmann · 2026-02-26T16:04:53Z

csrc/multidevice/nixl.cpp

+};
+
+NixlTransferHandle::NixlTransferHandle() = default;
+NixlTransferHandle::~NixlTransferHandle() = default;


memory leak ?

samnordmann · 2026-02-26T16:05:29Z

csrc/multidevice/nixl.cpp

+    NVF_THROW("Failed to create UCX backend for NIXL agent");
+  }
+
+  // Probe: verify that VRAM (CUDA GPU memory) is actually usable with


is that really necessary? It looks suspicious to me, can you help me understand?

samnordmann · 2026-02-26T16:07:56Z

csrc/multidevice/nixl.cpp

+#endif
+}
+
+void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) {


This wait function is cpu blocking so in practice it is more or less unusable in our context. Do you have an idea how to make this not blocking for cpu -- and ideally cuda-graph capturable?

samnordmann · 2026-02-26T16:19:07Z

@x41lakazam Can you provide instructions on how to build nixl, say, from pjnl docker image? We'll probably need to think how to add the library is the base image and/or the CI, unless it is already shipped in some DLFW package

samnordmann · 2026-02-26T16:58:21Z

unless it is already shipped in some DLFW package

https://nvidia.slack.com/archives/C08KL9MNQ3U/p1771951941351029

samnordmann and others added 9 commits January 21, 2026 02:11

first working dispatch and combine primitive for k=1

cf77bdb

add comments and cleanup

66e7811

add kernel based a2av and cuda backend for d/c

dda9aa7

unstable - add nixl backend

7aa2de8

unstable

9a8a377

add python build changes for nixl

0f21528

fix typo

6144827

merge main

04a9133

restore main:

b32587a

fix bug where zero-length buffer was passed to nixl

f8a94fc

samnordmann reviewed Feb 26, 2026

View reviewed changes

	template <typename T>
	std::vector<uint8_t> toBytes(const T& data) {
	return std::vector<uint8_t>(
	reinterpret_cast<const uint8_t*>(&data),
	reinterpret_cast<const uint8_t*>(&data) + sizeof(T));
	}

	template <typename T>
	const T& fromBytes(const std::vector<uint8_t>& bytes) {
	return reinterpret_cast<const T>(bytes.data());
	}

Conversation

x41lakazam commented Feb 26, 2026

Uh oh!

github-actions bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

samnordmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samnordmann commented Feb 26, 2026

Uh oh!

samnordmann commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Feb 26, 2026 •

edited

Loading

samnordmann commented Feb 26, 2026 •

edited

Loading