-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test fail use 2node on tests/test_internode.py #13
Comments
hello,can I ask you how you compile nvshmem,my gdrcopy_copybw had data performence,but when compile nvshmem,I can't find the path of gdrcopy.it's different with gdrcopy_copybw? |
1.install gdrcopy
1.2 Soft link full path
1.3 install deb package
2.install nvshmem_src
|
I think the script should be |
Yeah, there's a little glitch in my paste run script. |
It appears that you are experiencing network connectivity issues. Please provide additional information, including:
Additionally, include hardware details, such as:
Furthermore, please share the results of the following reports, if available:
|
@haswelliris
GPU topology
gdrcopy_copybw
nvshmem perftest report:
|
|
hi @haswelliris I have 8 Hopper GPUs and 4 50GB/s Mellanox CX7 NICs per node, NCCL 2 node allreduce busbw can achieve 193GB/, I following the deepep guide and run test_intranode.py, but the performance is extramely low (while intranode is normal), do you have any ideas? [tuning] Best dispatch (FP8): SMs 24, NVL chunk 12, RDMA chunk 12: 5.36 GB/s (RDMA), 17.49 GB/s (NVL) |
script:
generate log:
This error is in
csrc/deep_ep.cpp
L765It is caught and thrown after timeout. How to analyze the specific problem?
The text was updated successfully, but these errors were encountered: