Failing PyTorch Collective Operations on Non-Homogeneous Ring (NVIDIA and AMD GPUs) #1045

RafalSiwek · 2024-11-01T13:36:21Z

Hi UCC Team,

Following the resolution of my initial issue #1034, I am now extending the proof-of-concept (PoC) to test distributed ML workflows using PyTorch with a heterogeneous setup: g4ad.xlarge (AMD ROCm) and g4dn.xlarge (NVIDIA CUDA) instances.

(All relevant code, log outputs, and observations to provide additional context can be looked up here - https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations.)

Summary of Setup

Software and Hardware Configuration

ROCm (AMD): g4ad.xlarge instance with AMD Radeon Pro V520 (RDNA1 architecture, gfx1011 shader) and ROCm 6.2.2.
CUDA (NVIDIA): g4dn.xlarge instance with NVIDIA T4 GPU (Turing architecture) and CUDA 12.4.
MPI: Built with UCC and UCX as the transport layer.
Environment: Containers for each GPU type, configured with UCX and UCC builds (see the paragraph describing Dockerfiles with installation scripts and config outputs for UCC and UCX here):.

Observed Behavior

Bi-Directional Communication (test code here): Running a basic bidirectional send_recv test in PyTorch was successful, confirming basic communication across the GPUs (logs available here).
Allreduce Operation (test code here):
- With UCC Collectives Enabled: The allreduce operation fails consistently on the ucp_tag_send_nbx function for both ranks in PyTorch (logs and backtrace available here).
- Without UCC Collectives for MPI: The allreduce completes successfully on the CUDA rank but fails on the ROCm rank (logs and backtrace available here).

Request for Technical Insights

To better understand and resolve the failure in ucp_tag_send_nbx during the allreduce operation, I would appreciate guidance on the following:

Potential Causes of ucp_tag_send_nbx Failures: Could you provide technical insights into why the ucp_tag_send_nbx operation might fail within a mixed GPU environment (CUDA and ROCm) under UCC? I have examined logs and stack traces, but a deeper understanding of specific communication or memory operations that might impact ucp_tag_send_nbx in heterogeneous setups would be helpful.
Additional Diagnostic Tests: If there are specific configurations, environment variables, or diagnostic flags that could help reveal more details about the UCX and UCC behaviors in this setup, I would be glad to run further tests.

Thank you for your continued support and assistance with this project!

The text was updated successfully, but these errors were encountered:

edgargabriel · 2024-11-01T14:27:25Z

@RafalSiwek let me comment on the part that I am confident about: I don't think the MPI collective without UCC can work (at least not for reductions): you might see different components being selected for the process running the on the cuda-ip node vs. the process running on the rocm-ip node. In my opinion, UCC using tl/ucp is your best (only?) choice at the moment for this configuration.

Regarding the ucp_tag_send_nbx failure: I am not entirely sure since the simple send-recv test worked. I would recommend to try to run something like the osu_latency or osu_bw benchmark across the two nodes, that would probably stress the system/UCX a bit more in this scenario vs. just a single message in both directions. If the osu_latency benchmark using device memory works, there is reasonable chance that the UCX side of the software stack is working.

RafalSiwek · 2024-11-01T22:28:43Z

Thanks @edgargabriel for the feedback and for clarifying the MPI collective aspect. I'll keep that in mind and focus on UCC with tl/ucp as recommended.

Following up on your suggestion, I ran a set of OSU Micro-Benchmarks 7.4 to get more insight into the OMPI+UCC+UCX performance. Here are the results:

P2P Bi-Directional Bandwidth Benchmark - Ran successfully with all data types, showing expected bandwidth limitations (logs here).
Collective AllGather Latency Benchmark - No issues; completed as expected with impact on latency (logs here).
Collective AllReduce Latency Benchmark - With validation off, it completed without issues (logs here). With validation on, it failed for all data types, showing notable discrepancies between expected and actual values (logs here). Interestingly, in version 7.2 of the OSU benchmark suite (with validation approach before the change), validation passed without issue (logs here).

The benchmark results show that collective communication across these GPUs is functional but does hit some performance and validation limitations. Given that, I’m curious if the ucp_tag_send_nbx failure in PyTorch could stem from how PyTorch handles collectives. In PyTorch, their allreduce implementation is similar to what I’m using in my allreduce test, but maybe there’s a key difference or configuration or system stress factor I’m missing.

Based on these results, I'm wondering if you might have any insights into what could be causing the ucp_tag_send_nbx error specifically during collective operations with PyTorch or in what community to ask for guidance to understand this issue. Given that basic send-receive communication works fine and that the custom allreduce runs successfully, it’s puzzling why this failure arises with PyTorch collectives. Thank you again for your time and help with this!

RafalSiwek · 2024-11-03T23:35:38Z

The issue appears to be related to the UCX configuration. I had enabled multi-threading support, which this setup might not fully support. After rebuilding UCX with multi-threading disabled, everything seems to be working correctly.

Thank you very much for your support!

RafalSiwek closed this as completed Nov 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing PyTorch Collective Operations on Non-Homogeneous Ring (NVIDIA and AMD GPUs) #1045

Failing PyTorch Collective Operations on Non-Homogeneous Ring (NVIDIA and AMD GPUs) #1045

RafalSiwek commented Nov 1, 2024

edgargabriel commented Nov 1, 2024 •

edited

Loading

RafalSiwek commented Nov 1, 2024 •

edited

Loading

RafalSiwek commented Nov 3, 2024

Failing PyTorch Collective Operations on Non-Homogeneous Ring (NVIDIA and AMD GPUs) #1045

Failing PyTorch Collective Operations on Non-Homogeneous Ring (NVIDIA and AMD GPUs) #1045

Comments

RafalSiwek commented Nov 1, 2024

Summary of Setup

Software and Hardware Configuration

Observed Behavior

Request for Technical Insights

edgargabriel commented Nov 1, 2024 • edited Loading

RafalSiwek commented Nov 1, 2024 • edited Loading

RafalSiwek commented Nov 3, 2024

edgargabriel commented Nov 1, 2024 •

edited

Loading

RafalSiwek commented Nov 1, 2024 •

edited

Loading