Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing PyTorch Collective Operations on Non-Homogeneous Ring (NVIDIA and AMD GPUs) #1045

Closed
RafalSiwek opened this issue Nov 1, 2024 · 3 comments

Comments

@RafalSiwek
Copy link

Hi UCC Team,

Following the resolution of my initial issue #1034, I am now extending the proof-of-concept (PoC) to test distributed ML workflows using PyTorch with a heterogeneous setup: g4ad.xlarge (AMD ROCm) and g4dn.xlarge (NVIDIA CUDA) instances.

(All relevant code, log outputs, and observations to provide additional context can be looked up here - https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations.)

Summary of Setup

Software and Hardware Configuration

  • ROCm (AMD): g4ad.xlarge instance with AMD Radeon Pro V520 (RDNA1 architecture, gfx1011 shader) and ROCm 6.2.2.
  • CUDA (NVIDIA): g4dn.xlarge instance with NVIDIA T4 GPU (Turing architecture) and CUDA 12.4.
  • MPI: Built with UCC and UCX as the transport layer.
  • Environment: Containers for each GPU type, configured with UCX and UCC builds (see the paragraph describing Dockerfiles with installation scripts and config outputs for UCC and UCX here):.

Observed Behavior

  1. Bi-Directional Communication (test code here): Running a basic bidirectional send_recv test in PyTorch was successful, confirming basic communication across the GPUs (logs available here).

  2. Allreduce Operation (test code here):

Request for Technical Insights

To better understand and resolve the failure in ucp_tag_send_nbx during the allreduce operation, I would appreciate guidance on the following:

  1. Potential Causes of ucp_tag_send_nbx Failures: Could you provide technical insights into why the ucp_tag_send_nbx operation might fail within a mixed GPU environment (CUDA and ROCm) under UCC? I have examined logs and stack traces, but a deeper understanding of specific communication or memory operations that might impact ucp_tag_send_nbx in heterogeneous setups would be helpful.

  2. Additional Diagnostic Tests: If there are specific configurations, environment variables, or diagnostic flags that could help reveal more details about the UCX and UCC behaviors in this setup, I would be glad to run further tests.

Thank you for your continued support and assistance with this project!

@edgargabriel
Copy link
Contributor

edgargabriel commented Nov 1, 2024

@RafalSiwek let me comment on the part that I am confident about: I don't think the MPI collective without UCC can work (at least not for reductions): you might see different components being selected for the process running the on the cuda-ip node vs. the process running on the rocm-ip node. In my opinion, UCC using tl/ucp is your best (only?) choice at the moment for this configuration.

Regarding the ucp_tag_send_nbx failure: I am not entirely sure since the simple send-recv test worked. I would recommend to try to run something like the osu_latency or osu_bw benchmark across the two nodes, that would probably stress the system/UCX a bit more in this scenario vs. just a single message in both directions. If the osu_latency benchmark using device memory works, there is reasonable chance that the UCX side of the software stack is working.

@RafalSiwek
Copy link
Author

RafalSiwek commented Nov 1, 2024

Thanks @edgargabriel for the feedback and for clarifying the MPI collective aspect. I'll keep that in mind and focus on UCC with tl/ucp as recommended.

Following up on your suggestion, I ran a set of OSU Micro-Benchmarks 7.4 to get more insight into the OMPI+UCC+UCX performance. Here are the results:

  • P2P Bi-Directional Bandwidth Benchmark - Ran successfully with all data types, showing expected bandwidth limitations (logs here).
  • Collective AllGather Latency Benchmark - No issues; completed as expected with impact on latency (logs here).
  • Collective AllReduce Latency Benchmark - With validation off, it completed without issues (logs here). With validation on, it failed for all data types, showing notable discrepancies between expected and actual values (logs here). Interestingly, in version 7.2 of the OSU benchmark suite (with validation approach before the change), validation passed without issue (logs here).

The benchmark results show that collective communication across these GPUs is functional but does hit some performance and validation limitations. Given that, I’m curious if the ucp_tag_send_nbx failure in PyTorch could stem from how PyTorch handles collectives. In PyTorch, their allreduce implementation is similar to what I’m using in my allreduce test, but maybe there’s a key difference or configuration or system stress factor I’m missing.

Based on these results, I'm wondering if you might have any insights into what could be causing the ucp_tag_send_nbx error specifically during collective operations with PyTorch or in what community to ask for guidance to understand this issue. Given that basic send-receive communication works fine and that the custom allreduce runs successfully, it’s puzzling why this failure arises with PyTorch collectives. Thank you again for your time and help with this!

@RafalSiwek
Copy link
Author

The issue appears to be related to the UCX configuration. I had enabled multi-threading support, which this setup might not fully support. After rebuilding UCX with multi-threading disabled, everything seems to be working correctly.

Thank you very much for your support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants