-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing PyTorch Collective Operations on Non-Homogeneous Ring (NVIDIA and AMD GPUs) #1045
Comments
@RafalSiwek let me comment on the part that I am confident about: I don't think the MPI collective without UCC can work (at least not for reductions): you might see different components being selected for the process running the on the cuda-ip node vs. the process running on the rocm-ip node. In my opinion, UCC using tl/ucp is your best (only?) choice at the moment for this configuration. Regarding the |
Thanks @edgargabriel for the feedback and for clarifying the MPI collective aspect. I'll keep that in mind and focus on UCC with Following up on your suggestion, I ran a set of OSU Micro-Benchmarks 7.4 to get more insight into the OMPI+UCC+UCX performance. Here are the results:
The benchmark results show that collective communication across these GPUs is functional but does hit some performance and validation limitations. Given that, I’m curious if the Based on these results, I'm wondering if you might have any insights into what could be causing the ucp_tag_send_nbx error specifically during collective operations with PyTorch or in what community to ask for guidance to understand this issue. Given that basic send-receive communication works fine and that the custom allreduce runs successfully, it’s puzzling why this failure arises with PyTorch collectives. Thank you again for your time and help with this! |
The issue appears to be related to the UCX configuration. I had enabled multi-threading support, which this setup might not fully support. After rebuilding UCX with multi-threading disabled, everything seems to be working correctly. Thank you very much for your support! |
Hi UCC Team,
Following the resolution of my initial issue #1034, I am now extending the proof-of-concept (PoC) to test distributed ML workflows using PyTorch with a heterogeneous setup:
g4ad.xlarge
(AMD ROCm) andg4dn.xlarge
(NVIDIA CUDA) instances.(All relevant code, log outputs, and observations to provide additional context can be looked up here - https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations.)
Summary of Setup
Software and Hardware Configuration
g4ad.xlarge
instance with AMD Radeon Pro V520 (RDNA1 architecture,gfx1011
shader) and ROCm 6.2.2.g4dn.xlarge
instance with NVIDIA T4 GPU (Turing architecture) and CUDA 12.4.Observed Behavior
Bi-Directional Communication (test code here): Running a basic bidirectional
send_recv
test in PyTorch was successful, confirming basic communication across the GPUs (logs available here).Allreduce Operation (test code here):
allreduce
operation fails consistently on theucp_tag_send_nbx
function for both ranks in PyTorch (logs and backtrace available here).allreduce
completes successfully on the CUDA rank but fails on the ROCm rank (logs and backtrace available here).Request for Technical Insights
To better understand and resolve the failure in
ucp_tag_send_nbx
during theallreduce
operation, I would appreciate guidance on the following:Potential Causes of
ucp_tag_send_nbx
Failures: Could you provide technical insights into why theucp_tag_send_nbx
operation might fail within a mixed GPU environment (CUDA and ROCm) under UCC? I have examined logs and stack traces, but a deeper understanding of specific communication or memory operations that might impactucp_tag_send_nbx
in heterogeneous setups would be helpful.Additional Diagnostic Tests: If there are specific configurations, environment variables, or diagnostic flags that could help reveal more details about the UCX and UCC behaviors in this setup, I would be glad to run further tests.
Thank you for your continued support and assistance with this project!
The text was updated successfully, but these errors were encountered: