Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test test_low_latency.py failed #15

Open
Nekofish-L opened this issue Feb 25, 2025 · 5 comments
Open

Test test_low_latency.py failed #15

Nekofish-L opened this issue Feb 25, 2025 · 5 comments

Comments

@Nekofish-L
Copy link

I encountered an issue while running the test python test_low_latency.py. The test is failing with the following error:

Allocating buffer size: 2116.2912 MB ...
[/root/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch 
[/root/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch 
[/root/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch 
malloc(): unsorted double linked list corrupted
[/root/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch 
corrupted double-linked list
[/root/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch 
[/root/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch 
[/root/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch 
malloc(): unsorted double linked list corrupted
[/root/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch 
malloc(): unsorted double linked list corrupted
malloc(): unsorted double linked list corrupted
malloc(): unsorted double linked list corrupted
malloc(): unsorted double linked list corrupted
W0225 20:24:08.312000 54083 /data/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 54153 via signal SIGTERM
W0225 20:24:08.313000 54083 /data/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 54154 via signal SIGTERM
W0225 20:24:08.313000 54083 /data/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 54157 via signal SIGTERM
W0225 20:24:08.314000 54083 /data/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 54158 via signal SIGTERM

However, another related test tests/test_intranode.py runs successfully and produces the expected output.

Environment:

  • TencentOS Server release 3.1 (CentOS)
  • H20 * 8
  • CUDA 12.3
@haswelliris
Copy link
Collaborator

Please provide additional information, including:

  • Network Card: You can obtain this information using the command ibstatus.
  • GPU Topology: Use the command nvidia-smi topo -m to retrieve this information.

Furthermore, please share the results of the following reports, if available:

  • nvshmem perftest Report:

Alternatively, you can simply provide the output of the command:

/path/to/nvshmem/dir/bin/nvshmem-info -a

@LyricZhao
Copy link
Collaborator

There is a line num_tokens, hidden, num_topk, num_experts = 128, 7168, 8, 288 in the test, indicating the SM used is num_experts / 3 = 96 (refer to https://github.com/deepseek-ai/DeepEP/blob/main/csrc/kernels/internode_ll.cu#L305C26-L305C63), which exceeds the H20 SM count.

You can try to reduce the number of experts.

@LyricZhao
Copy link
Collaborator

LyricZhao commented Feb 25, 2025

We may add H20 compatible changes later.

@MengYu10151
Copy link

We may add H20 compatible changes later.

Hi Lyric, in addition to reducing the number of experts, is it possible to enable 288 experts by changing other parameters? and I also wonder if the H20 support change will be launched soon? Thanks!

@LyricZhao
Copy link
Collaborator

is it possible to enable 288 experts by changing other parameters

It is possible, and for an easier solution (I will update the mainstream later), you can simply change (2 places, both dispatch and combine)

constexpr int kNumWarpsPerGroup = 10;
constexpr int kNumWarpGroups = 3;

into

constexpr int kNumWarpsPerGroup = 8;
constexpr int kNumWarpGroups = 4;

and remove EP_HOST_ASSERT(cell_div(static_cast<int>(hidden * 2 / sizeof(int4)), 32 * (num_warps - 1)) <= 2);.

I don't have H20 for testing, if the performance is not good, please report and file an issue :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants