Test test_low_latency.py failed #15

Nekofish-L · 2025-02-25T12:32:08Z

I encountered an issue while running the test python test_low_latency.py. The test is failing with the following error:

Allocating buffer size: 2116.2912 MB ...
[/root/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch 
[/root/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch 
[/root/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch 
malloc(): unsorted double linked list corrupted
[/root/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch 
corrupted double-linked list
[/root/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch 
[/root/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch 
[/root/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch 
malloc(): unsorted double linked list corrupted
[/root/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch 
malloc(): unsorted double linked list corrupted
malloc(): unsorted double linked list corrupted
malloc(): unsorted double linked list corrupted
malloc(): unsorted double linked list corrupted
W0225 20:24:08.312000 54083 /data/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 54153 via signal SIGTERM
W0225 20:24:08.313000 54083 /data/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 54154 via signal SIGTERM
W0225 20:24:08.313000 54083 /data/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 54157 via signal SIGTERM
W0225 20:24:08.314000 54083 /data/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 54158 via signal SIGTERM

However, another related test tests/test_intranode.py runs successfully and produces the expected output.

Environment:

TencentOS Server release 3.1 (CentOS)
H20 * 8
CUDA 12.3

The text was updated successfully, but these errors were encountered:

haswelliris · 2025-02-25T13:27:55Z

Please provide additional information, including:

Network Card: You can obtain this information using the command ibstatus.
GPU Topology: Use the command nvidia-smi topo -m to retrieve this information.

Furthermore, please share the results of the following reports, if available:

nvshmem perftest Report:

Alternatively, you can simply provide the output of the command:

/path/to/nvshmem/dir/bin/nvshmem-info -a

LyricZhao · 2025-02-25T15:17:39Z

There is a line num_tokens, hidden, num_topk, num_experts = 128, 7168, 8, 288 in the test, indicating the SM used is num_experts / 3 = 96 (refer to https://github.com/deepseek-ai/DeepEP/blob/main/csrc/kernels/internode_ll.cu#L305C26-L305C63), which exceeds the H20 SM count.

You can try to reduce the number of experts.

LyricZhao · 2025-02-25T15:18:18Z

We may add H20 compatible changes later.

MengYu10151 · 2025-03-10T07:41:40Z

We may add H20 compatible changes later.

Hi Lyric, in addition to reducing the number of experts, is it possible to enable 288 experts by changing other parameters? and I also wonder if the H20 support change will be launched soon? Thanks!

LyricZhao · 2025-03-10T07:58:23Z

is it possible to enable 288 experts by changing other parameters

It is possible, and for an easier solution (I will update the mainstream later), you can simply change (2 places, both dispatch and combine)

constexpr int kNumWarpsPerGroup = 10;
constexpr int kNumWarpGroups = 3;

into

constexpr int kNumWarpsPerGroup = 8;
constexpr int kNumWarpGroups = 4;

and remove EP_HOST_ASSERT(cell_div(static_cast<int>(hidden * 2 / sizeof(int4)), 32 * (num_warps - 1)) <= 2);.

I don't have H20 for testing, if the performance is not good, please report and file an issue :)

yanminjia mentioned this issue Mar 14, 2025

test_internode.py hangs at internode.cu:dispatch(...) #74

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test test_low_latency.py failed #15

Test test_low_latency.py failed #15

Nekofish-L commented Feb 25, 2025

haswelliris commented Feb 25, 2025

LyricZhao commented Feb 25, 2025

LyricZhao commented Feb 25, 2025 •

edited

Loading

MengYu10151 commented Mar 10, 2025

LyricZhao commented Mar 10, 2025

Test test_low_latency.py failed #15

Test test_low_latency.py failed #15

Comments

Nekofish-L commented Feb 25, 2025

haswelliris commented Feb 25, 2025

LyricZhao commented Feb 25, 2025

LyricZhao commented Feb 25, 2025 • edited Loading

MengYu10151 commented Mar 10, 2025

LyricZhao commented Mar 10, 2025

LyricZhao commented Feb 25, 2025 •

edited

Loading