Skip to content

Segmentation Error when running SimCCL #146

@linsyking

Description

@linsyking

Steps

1.1 In the SimCCL folder, I first need to chmod +x src/device/generate.py otherwise there will be an error saying that permission denied.
1.2 I ran make -j src.build which succeeded.
1.3 cd test, make
1.4 bash run.sh

==> Error!

Log

NCCL version 2.20.5-MockNCCL+cuda12.6

inspur-gpu-server-15:2563676:2563678 [0] misc/cudawrap.cc:34 NCCL WARN Cuda failure 'no CUDA-capable device is detected'

inspur-gpu-server-15:2563676:2563678 [0] init.cc:270 NCCL WARN Cuda failure 3 'initialization error'
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Mocking start: mock_nNodes(8), mock_nRanks(64), mock_nRanksPerNode(8).
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO info_index(0): 0x7f9f08096a30, rank: 0, cudaDev: 0, nvmlDev: 0, gdrSupport: 1,hostHash: 2935421449793392623, pidHash: 12347059415262710392, busId: ffffffffffffffff:ff:ff.f, cudaCompCap: -1.
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm_index(0): 0x55c7f6f30c00, node: 0, nNodes: 0, localRank: 0, localRanks: 0, maxLocalRanks: 0, intraRank: 0, intraRanks: 0, ncclCollNet: 0.
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO nNodes: 64.
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm_index(0): 0x55c7f6f30c00, node: 0, nNodes: 0, localRank: 0, localRanks: 0, maxLocalRanks: 0, intraRank: 0, intraRanks: 1, ncclCollNet: 0.
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO After ncclTopoGetSystem, system_info{nodes[NET].count: 5, nodes[GPU].count: 8, maxBW: 0.000000, totalBW: 0.000000}.
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[0].id: 0000:13:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[1].id: 0000:19:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[2].id: 0000:48:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[3].id: 0000:4d:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[4].id: 0000:89:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[5].id: 0000:8e:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[6].id: 0000:ad:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[7].id: 0000:b3:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO After ncclTopoGetSystem, system_info{nodes[NET].count: 5, nodes[GPU].count: 8, maxBW: 0.000000, totalBW: 0.000000}.
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[0].id: 0000:13:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[1].id: 0000:19:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[2].id: 0000:48:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[3].id: 0000:4d:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[4].id: 0000:89:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[5].id: 0000:8e:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[6].id: 0000:ad:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO (*system)->nodes[GPU].nodes[7].id: 0000:b3:00.0, gdr = 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO After ncclTopoSearchInit, system_info{nodes[NET].count: 5, nodes[GPU].count: 8, maxBW: 24.000000, totalBW: 240.000000}.
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO GPU/13000 :GPU/13000 (0/5000.000000/LOC) GPU/19000 (2/240.000000/NVL) GPU/48000 (2/240.000000/NVL) GPU/4D000 (2/240.000000/NVL) GPU/89000 (2/240.000000/NVL) GPU/8E000 (2/240.000000/NVL) GPU/AD000 (2/240.000000/NVL) GPU/B3000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/24.000000/PHB) NET/1 (4/24.000000/PXB) NET/2 (6/24.000000/PXN) NET/3 (6/24.000000/PXN) NET/4 (6/24.000000/PXN)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO GPU/19000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (0/5000.000000/LOC) GPU/48000 (2/240.000000/NVL) GPU/4D000 (2/240.000000/NVL) GPU/89000 (2/240.000000/NVL) GPU/8E000 (2/240.000000/NVL) GPU/AD000 (2/240.000000/NVL) GPU/B3000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/24.000000/PHB) NET/1 (4/24.000000/PXB) NET/2 (6/24.000000/PXN) NET/3 (6/24.000000/PXN) NET/4 (6/24.000000/PXN)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO GPU/48000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (2/240.000000/NVL) GPU/48000 (0/5000.000000/LOC) GPU/4D000 (2/240.000000/NVL) GPU/89000 (2/240.000000/NVL) GPU/8E000 (2/240.000000/NVL) GPU/AD000 (2/240.000000/NVL) GPU/B3000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/24.000000/PHB) NET/1 (6/24.000000/PXN) NET/2 (4/24.000000/PXB) NET/3 (6/24.000000/PXN) NET/4 (6/24.000000/PXN)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO GPU/4D000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (2/240.000000/NVL) GPU/48000 (2/240.000000/NVL) GPU/4D000 (0/5000.000000/LOC) GPU/89000 (2/240.000000/NVL) GPU/8E000 (2/240.000000/NVL) GPU/AD000 (2/240.000000/NVL) GPU/B3000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/24.000000/PHB) NET/1 (6/24.000000/PXN) NET/2 (4/24.000000/PXB) NET/3 (6/24.000000/PXN) NET/4 (6/24.000000/PXN)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO GPU/89000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (2/240.000000/NVL) GPU/48000 (2/240.000000/NVL) GPU/4D000 (2/240.000000/NVL) GPU/89000 (0/5000.000000/LOC) GPU/8E000 (2/240.000000/NVL) GPU/AD000 (2/240.000000/NVL) GPU/B3000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (6/24.000000/PXN) NET/2 (6/24.000000/PXN) NET/3 (4/24.000000/PXB) NET/4 (6/24.000000/PXN)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO GPU/8E000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (2/240.000000/NVL) GPU/48000 (2/240.000000/NVL) GPU/4D000 (2/240.000000/NVL) GPU/89000 (2/240.000000/NVL) GPU/8E000 (0/5000.000000/LOC) GPU/AD000 (2/240.000000/NVL) GPU/B3000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (6/24.000000/PXN) NET/2 (6/24.000000/PXN) NET/3 (4/24.000000/PXB) NET/4 (6/24.000000/PXN)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO GPU/AD000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (2/240.000000/NVL) GPU/48000 (2/240.000000/NVL) GPU/4D000 (2/240.000000/NVL) GPU/89000 (2/240.000000/NVL) GPU/8E000 (2/240.000000/NVL) GPU/AD000 (0/5000.000000/LOC) GPU/B3000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (6/24.000000/PXN) NET/2 (6/24.000000/PXN) NET/3 (6/24.000000/PXN) NET/4 (4/24.000000/PXB)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO GPU/B3000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (2/240.000000/NVL) GPU/48000 (2/240.000000/NVL) GPU/4D000 (2/240.000000/NVL) GPU/89000 (2/240.000000/NVL) GPU/8E000 (2/240.000000/NVL) GPU/AD000 (2/240.000000/NVL) GPU/B3000 (0/5000.000000/LOC) NVS/0 (1/240.000000/NVL) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (6/24.000000/PXN) NET/2 (6/24.000000/PXN) NET/3 (6/24.000000/PXN) NET/4 (4/24.000000/PXB)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO NET/0 :GPU/13000 (5/24.000000/PHB) GPU/19000 (5/24.000000/PHB) GPU/48000 (5/24.000000/PHB) GPU/4D000 (5/24.000000/PHB) GPU/89000 (6/10.000000/SYS) GPU/8E000 (6/10.000000/SYS) GPU/AD000 (6/10.000000/SYS) GPU/B3000 (6/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (5/24.000000/PHB) NET/2 (5/24.000000/PHB) NET/3 (6/10.000000/SYS) NET/4 (6/10.000000/SYS)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO NET/1 :GPU/13000 (4/24.000000/PXB) GPU/19000 (4/24.000000/PXB) GPU/48000 (6/24.000000/PHB) GPU/4D000 (6/24.000000/PHB) GPU/89000 (7/10.000000/SYS) GPU/8E000 (7/10.000000/SYS) GPU/AD000 (7/10.000000/SYS) GPU/B3000 (7/10.000000/SYS) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/24.000000/PHB) NET/1 (0/5000.000000/LOC) NET/2 (6/24.000000/PHB) NET/3 (7/10.000000/SYS) NET/4 (7/10.000000/SYS)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO NET/2 :GPU/13000 (6/24.000000/PHB) GPU/19000 (6/24.000000/PHB) GPU/48000 (4/24.000000/PXB) GPU/4D000 (4/24.000000/PXB) GPU/89000 (7/10.000000/SYS) GPU/8E000 (7/10.000000/SYS) GPU/AD000 (7/10.000000/SYS) GPU/B3000 (7/10.000000/SYS) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/24.000000/PHB) NET/1 (6/24.000000/PHB) NET/2 (0/5000.000000/LOC) NET/3 (7/10.000000/SYS) NET/4 (7/10.000000/SYS)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO NET/3 :GPU/13000 (7/10.000000/SYS) GPU/19000 (7/10.000000/SYS) GPU/48000 (7/10.000000/SYS) GPU/4D000 (7/10.000000/SYS) GPU/89000 (4/24.000000/PXB) GPU/8E000 (4/24.000000/PXB) GPU/AD000 (6/24.000000/PHB) GPU/B3000 (6/24.000000/PHB) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (7/10.000000/SYS) NET/2 (7/10.000000/SYS) NET/3 (0/5000.000000/LOC) NET/4 (6/24.000000/PHB)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO NET/4 :GPU/13000 (7/10.000000/SYS) GPU/19000 (7/10.000000/SYS) GPU/48000 (7/10.000000/SYS) GPU/4D000 (7/10.000000/SYS) GPU/89000 (6/24.000000/PHB) GPU/8E000 (6/24.000000/PHB) GPU/AD000 (4/24.000000/PXB) GPU/B3000 (4/24.000000/PXB) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (7/10.000000/SYS) NET/2 (7/10.000000/SYS) NET/3 (6/24.000000/PHB) NET/4 (0/5000.000000/LOC)
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO s->nodes[GPU].nodes[0].gpu.rank: 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO s->nodes[GPU].nodes[1].gpu.rank: 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO s->nodes[GPU].nodes[2].gpu.rank: 2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO s->nodes[GPU].nodes[3].gpu.rank: 3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO s->nodes[GPU].nodes[4].gpu.rank: 4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO s->nodes[GPU].nodes[5].gpu.rank: 5
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO s->nodes[GPU].nodes[6].gpu.rank: 6
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO s->nodes[GPU].nodes[7].gpu.rank: 7
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm->compCap: 80, comm->minCompCap: 80, comm->maxCompCap: 80
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO mock_comm[0].compCap: 80, mock_comm[0].minCompCap: 80, mock_comm[0].maxCompCap: 80
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/0 GPU/7 GPU/6 GPU/5 GPU/4 GPU/3 GPU/1 GPU/2 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/2 GPU/1 GPU/7 GPU/6 GPU/5 GPU/4 GPU/3 GPU/0 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/4 GPU/3 GPU/2 GPU/1 GPU/0 GPU/7 GPU/5 GPU/6 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/6 GPU/5 GPU/2 GPU/1 GPU/0 GPU/7 GPU/3 GPU/4 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7 GPU/0 GPU/1 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/4 GPU/5 GPU/6 GPU/7 GPU/0 GPU/1 GPU/2 GPU/3 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/6 GPU/7 GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock ring Graph[node: 0]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/0 GPU/7 GPU/6 GPU/5 GPU/4 GPU/3 GPU/1 GPU/2 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/2 GPU/1 GPU/7 GPU/6 GPU/5 GPU/4 GPU/3 GPU/0 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/4 GPU/3 GPU/2 GPU/1 GPU/0 GPU/7 GPU/5 GPU/6 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/6 GPU/5 GPU/2 GPU/1 GPU/0 GPU/7 GPU/3 GPU/4 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock tree Graph[node: 0]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7 GPU/0 GPU/1 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/4 GPU/5 GPU/6 GPU/7 GPU/0 GPU/1 GPU/2 GPU/3 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/6 GPU/7 GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock ring Graph[node: 1]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/8 GPU/15 GPU/14 GPU/13 GPU/12 GPU/11 GPU/9 GPU/10 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/10 GPU/9 GPU/15 GPU/14 GPU/13 GPU/12 GPU/11 GPU/8 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/12 GPU/11 GPU/10 GPU/9 GPU/8 GPU/15 GPU/13 GPU/14 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/14 GPU/13 GPU/10 GPU/9 GPU/8 GPU/15 GPU/11 GPU/12 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock tree Graph[node: 1]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/8 GPU/9 GPU/10 GPU/11 GPU/12 GPU/13 GPU/14 GPU/15 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/10 GPU/11 GPU/12 GPU/13 GPU/14 GPU/15 GPU/8 GPU/9 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/12 GPU/13 GPU/14 GPU/15 GPU/8 GPU/9 GPU/10 GPU/11 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/14 GPU/15 GPU/8 GPU/9 GPU/10 GPU/11 GPU/12 GPU/13 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock ring Graph[node: 2]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/16 GPU/23 GPU/22 GPU/21 GPU/20 GPU/19 GPU/17 GPU/18 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/18 GPU/17 GPU/23 GPU/22 GPU/21 GPU/20 GPU/19 GPU/16 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/20 GPU/19 GPU/18 GPU/17 GPU/16 GPU/23 GPU/21 GPU/22 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/22 GPU/21 GPU/18 GPU/17 GPU/16 GPU/23 GPU/19 GPU/20 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock tree Graph[node: 2]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/16 GPU/17 GPU/18 GPU/19 GPU/20 GPU/21 GPU/22 GPU/23 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/18 GPU/19 GPU/20 GPU/21 GPU/22 GPU/23 GPU/16 GPU/17 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/20 GPU/21 GPU/22 GPU/23 GPU/16 GPU/17 GPU/18 GPU/19 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/22 GPU/23 GPU/16 GPU/17 GPU/18 GPU/19 GPU/20 GPU/21 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock ring Graph[node: 3]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/24 GPU/31 GPU/30 GPU/29 GPU/28 GPU/27 GPU/25 GPU/26 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/26 GPU/25 GPU/31 GPU/30 GPU/29 GPU/28 GPU/27 GPU/24 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/28 GPU/27 GPU/26 GPU/25 GPU/24 GPU/31 GPU/29 GPU/30 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/30 GPU/29 GPU/26 GPU/25 GPU/24 GPU/31 GPU/27 GPU/28 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock tree Graph[node: 3]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/24 GPU/25 GPU/26 GPU/27 GPU/28 GPU/29 GPU/30 GPU/31 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/26 GPU/27 GPU/28 GPU/29 GPU/30 GPU/31 GPU/24 GPU/25 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/28 GPU/29 GPU/30 GPU/31 GPU/24 GPU/25 GPU/26 GPU/27 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/30 GPU/31 GPU/24 GPU/25 GPU/26 GPU/27 GPU/28 GPU/29 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock ring Graph[node: 4]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/32 GPU/39 GPU/38 GPU/37 GPU/36 GPU/35 GPU/33 GPU/34 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/34 GPU/33 GPU/39 GPU/38 GPU/37 GPU/36 GPU/35 GPU/32 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/36 GPU/35 GPU/34 GPU/33 GPU/32 GPU/39 GPU/37 GPU/38 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/38 GPU/37 GPU/34 GPU/33 GPU/32 GPU/39 GPU/35 GPU/36 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock tree Graph[node: 4]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/32 GPU/33 GPU/34 GPU/35 GPU/36 GPU/37 GPU/38 GPU/39 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/34 GPU/35 GPU/36 GPU/37 GPU/38 GPU/39 GPU/32 GPU/33 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/36 GPU/37 GPU/38 GPU/39 GPU/32 GPU/33 GPU/34 GPU/35 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/38 GPU/39 GPU/32 GPU/33 GPU/34 GPU/35 GPU/36 GPU/37 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock ring Graph[node: 5]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/40 GPU/47 GPU/46 GPU/45 GPU/44 GPU/43 GPU/41 GPU/42 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/42 GPU/41 GPU/47 GPU/46 GPU/45 GPU/44 GPU/43 GPU/40 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/44 GPU/43 GPU/42 GPU/41 GPU/40 GPU/47 GPU/45 GPU/46 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/46 GPU/45 GPU/42 GPU/41 GPU/40 GPU/47 GPU/43 GPU/44 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock tree Graph[node: 5]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/40 GPU/41 GPU/42 GPU/43 GPU/44 GPU/45 GPU/46 GPU/47 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/42 GPU/43 GPU/44 GPU/45 GPU/46 GPU/47 GPU/40 GPU/41 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/44 GPU/45 GPU/46 GPU/47 GPU/40 GPU/41 GPU/42 GPU/43 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/46 GPU/47 GPU/40 GPU/41 GPU/42 GPU/43 GPU/44 GPU/45 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock ring Graph[node: 6]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/48 GPU/55 GPU/54 GPU/53 GPU/52 GPU/51 GPU/49 GPU/50 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/50 GPU/49 GPU/55 GPU/54 GPU/53 GPU/52 GPU/51 GPU/48 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/52 GPU/51 GPU/50 GPU/49 GPU/48 GPU/55 GPU/53 GPU/54 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/54 GPU/53 GPU/50 GPU/49 GPU/48 GPU/55 GPU/51 GPU/52 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock tree Graph[node: 6]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/48 GPU/49 GPU/50 GPU/51 GPU/52 GPU/53 GPU/54 GPU/55 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/50 GPU/51 GPU/52 GPU/53 GPU/54 GPU/55 GPU/48 GPU/49 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/52 GPU/53 GPU/54 GPU/55 GPU/48 GPU/49 GPU/50 GPU/51 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/54 GPU/55 GPU/48 GPU/49 GPU/50 GPU/51 GPU/52 GPU/53 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock ring Graph[node: 7]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 4, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/56 GPU/63 GPU/62 GPU/61 GPU/60 GPU/59 GPU/57 GPU/58 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/58 GPU/57 GPU/63 GPU/62 GPU/61 GPU/60 GPU/59 GPU/56 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/60 GPU/59 GPU/58 GPU/57 GPU/56 GPU/63 GPU/61 GPU/62 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/62 GPU/61 GPU/58 GPU/57 GPU/56 GPU/63 GPU/59 GPU/60 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO ------ Mock tree Graph[node: 7]: ------
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 4, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  0 : NET/1 GPU/56 GPU/57 GPU/58 GPU/59 GPU/60 GPU/61 GPU/62 GPU/63 NET/2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  1 : NET/2 GPU/58 GPU/59 GPU/60 GPU/61 GPU/62 GPU/63 GPU/56 GPU/57 NET/1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  2 : NET/3 GPU/60 GPU/61 GPU/62 GPU/63 GPU/56 GPU/57 GPU/58 GPU/59 NET/4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO  3 : NET/4 GPU/62 GPU/63 GPU/56 GPU/57 GPU/58 GPU/59 GPU/60 GPU/61 NET/3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm->nNodes: 8
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO treeGraph.nChannels: 4, ringGraph.nChannels: 4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 00/08 :     0    7    6    5    4    3    1    2   10    9   15   14   13   12   11    8   16   23   22   21   20   19   17   18   26   25   31   30   29   28   27   24   32   39   38   37   36   35   33   34   42   41   47   46   45   44   43   40   48   55   54   53   52   51   49   50   58   57   63   62   61   60   59   56
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 01/08 :     0    8   15   14   13   12   11    9   10   18   17   23   22   21   20   19   16   24   31   30   29   28   27   25   26   34   33   39   38   37   36   35   32   40   47   46   45   44   43   41   42   50   49   55   54   53   52   51   48   56   63   62   61   60   59   57   58    2    1    7    6    5    4    3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 02/08 :     0    7    5    6   14   13   10    9    8   15   11   12   20   19   18   17   16   23   21   22   30   29   26   25   24   31   27   28   36   35   34   33   32   39   37   38   46   45   42   41   40   47   43   44   52   51   50   49   48   55   53   54   62   61   58   57   56   63   59   60    4    3    2    1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 03/08 :     0    7    3    4   12   11   10    9    8   15   13   14   22   21   18   17   16   23   19   20   28   27   26   25   24   31   29   30   38   37   34   33   32   39   35   36   44   43   42   41   40   47   45   46   54   53   50   49   48   55   51   52   60   59   58   57   56   63   61   62    6    5    2    1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 04/08 :     0    7    6    5    4    3    1    2   10    9   15   14   13   12   11    8   16   23   22   21   20   19   17   18   26   25   31   30   29   28   27   24   32   39   38   37   36   35   33   34   42   41   47   46   45   44   43   40   48   55   54   53   52   51   49   50   58   57   63   62   61   60   59   56
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 05/08 :     0    8   15   14   13   12   11    9   10   18   17   23   22   21   20   19   16   24   31   30   29   28   27   25   26   34   33   39   38   37   36   35   32   40   47   46   45   44   43   41   42   50   49   55   54   53   52   51   48   56   63   62   61   60   59   57   58    2    1    7    6    5    4    3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 06/08 :     0    7    5    6   14   13   10    9    8   15   11   12   20   19   18   17   16   23   21   22   30   29   26   25   24   31   27   28   36   35   34   33   32   39   37   38   46   45   42   41   40   47   43   44   52   51   50   49   48   55   53   54   62   61   58   57   56   63   59   60    4    3    2    1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 07/08 :     0    7    3    4   12   11   10    9    8   15   13   14   22   21   18   17   16   23   19   20   28   27   26   25   24   31   29   30   38   37   34   33   32   39   35   36   44   43   42   41   40   47   45   46   54   53   50   49   48   55   51   52   60   59   58   57   56   63   61   62    6    5    2    1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO mock_comm->config.maxCTAs: 32, mock_comm->config.minCTAs: 1, mock_comm->sharedRes->tpNChannels: 0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 00/08 :     0    7    6    5    4    3    1    2   10    9   15   14   13   12   11    8   16   23   22   21   20   19   17   18   26   25   31   30   29   28   27   24   32   39   38   37   36   35   33   34   42   41   47   46   45   44   43   40   48   55   54   53   52   51   49   50   58   57   63   62   61   60   59   56
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 01/08 :     0    8   15   14   13   12   11    9   10   18   17   23   22   21   20   19   16   24   31   30   29   28   27   25   26   34   33   39   38   37   36   35   32   40   47   46   45   44   43   41   42   50   49   55   54   53   52   51   48   56   63   62   61   60   59   57   58    2    1    7    6    5    4    3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 02/08 :     0    7    5    6   14   13   10    9    8   15   11   12   20   19   18   17   16   23   21   22   30   29   26   25   24   31   27   28   36   35   34   33   32   39   37   38   46   45   42   41   40   47   43   44   52   51   50   49   48   55   53   54   62   61   58   57   56   63   59   60    4    3    2    1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 03/08 :     0    7    3    4   12   11   10    9    8   15   13   14   22   21   18   17   16   23   19   20   28   27   26   25   24   31   29   30   38   37   34   33   32   39   35   36   44   43   42   41   40   47   45   46   54   53   50   49   48   55   51   52   60   59   58   57   56   63   61   62    6    5    2    1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 04/08 :     0    7    6    5    4    3    1    2   10    9   15   14   13   12   11    8   16   23   22   21   20   19   17   18   26   25   31   30   29   28   27   24   32   39   38   37   36   35   33   34   42   41   47   46   45   44   43   40   48   55   54   53   52   51   49   50   58   57   63   62   61   60   59   56
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 05/08 :     0    8   15   14   13   12   11    9   10   18   17   23   22   21   20   19   16   24   31   30   29   28   27   25   26   34   33   39   38   37   36   35   32   40   47   46   45   44   43   41   42   50   49   55   54   53   52   51   48   56   63   62   61   60   59   57   58    2    1    7    6    5    4    3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 06/08 :     0    7    5    6   14   13   10    9    8   15   11   12   20   19   18   17   16   23   21   22   30   29   26   25   24   31   27   28   36   35   34   33   32   39   37   38   46   45   42   41   40   47   43   44   52   51   50   49   48   55   53   54   62   61   58   57   56   63   59   60    4    3    2    1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Channel 07/08 :     0    7    3    4   12   11   10    9    8   15   13   14   22   21   18   17   16   23   19   20   28   27   26   25   24   31   29   30   38   37   34   33   32   39   35   36   44   43   42   41   40   47   45   46   54   53   50   49   48   55   51   52   60   59   58   57   56   63   61   62    6    5    2    1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_0 Trees [0] 1/32/-1->0->-1 [1] 1/-1/-1->0->7 [2] 1/-1/-1->0->7 [3] 1/-1/-1->0->7 [4] 1/-1/-1->0->8 [5] 1/-1/-1->0->7 [6] 1/-1/-1->0->7 [7] 1/-1/-1->0->7
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_1 Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_2 Trees [0] 3/-1/-1->2->1 [1] 3/34/-1->2->-1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->10 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_3 Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] -1/-1/-1->3->2 [7] 4/-1/-1->3->2
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_4 Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/36/-1->4->-1 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->12 [7] 5/-1/-1->4->3
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_5 Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] -1/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] -1/-1/-1->5->4
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_6 Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/38/-1->6->-1 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->14
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_7 Trees [0] -1/-1/-1->7->6 [1] 0/-1/-1->7->6 [2] 0/-1/-1->7->6 [3] 0/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] 0/-1/-1->7->6 [6] 0/-1/-1->7->6 [7] 0/-1/-1->7->6
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_8 Trees [0] 9/-1/-1->8->17 [1] 9/-1/-1->8->15 [2] 9/-1/-1->8->15 [3] 9/-1/-1->8->15 [4] 9/0/-1->8->24 [5] 9/-1/-1->8->15 [6] 9/-1/-1->8->15 [7] 9/-1/-1->8->15
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_9 Trees [0] 10/-1/-1->9->8 [1] -1/-1/-1->9->8 [2] 10/-1/-1->9->8 [3] 10/-1/-1->9->8 [4] 10/16/-1->9->8 [5] -1/-1/-1->9->8 [6] 10/-1/-1->9->8 [7] 10/-1/-1->9->8
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_10 Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->19 [2] 11/-1/-1->10->9 [3] 11/-1/-1->10->9 [4] 11/-1/-1->10->9 [5] 11/2/-1->10->26 [6] 11/-1/-1->10->9 [7] 11/-1/-1->10->9
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_11 Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10 [2] -1/-1/-1->11->10 [3] 12/-1/-1->11->10 [4] 12/-1/-1->11->10 [5] 12/18/-1->11->10 [6] -1/-1/-1->11->10 [7] 12/-1/-1->11->10
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_12 Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11 [2] 13/-1/-1->12->21 [3] 13/-1/-1->12->11 [4] 13/-1/-1->12->11 [5] 13/-1/-1->12->11 [6] 13/4/-1->12->28 [7] 13/-1/-1->12->11
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_13 Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 [2] 14/-1/-1->13->12 [3] -1/-1/-1->13->12 [4] 14/-1/-1->13->12 [5] 14/-1/-1->13->12 [6] 14/20/-1->13->12 [7] -1/-1/-1->13->12
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_14 Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 [2] 15/-1/-1->14->13 [3] 15/-1/-1->14->23 [4] 15/-1/-1->14->13 [5] 15/-1/-1->14->13 [6] 15/-1/-1->14->13 [7] 15/6/-1->14->30
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_15 Trees [0] -1/-1/-1->15->14 [1] 8/-1/-1->15->14 [2] 8/-1/-1->15->14 [3] 8/-1/-1->15->14 [4] -1/-1/-1->15->14 [5] 8/-1/-1->15->14 [6] 8/-1/-1->15->14 [7] 8/22/-1->15->14
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_16 Trees [0] 17/24/-1->16->33 [1] 17/-1/-1->16->23 [2] 17/-1/-1->16->23 [3] 17/-1/-1->16->23 [4] 17/-1/-1->16->9 [5] 17/-1/-1->16->23 [6] 17/-1/-1->16->23 [7] 17/-1/-1->16->23
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_17 Trees [0] 18/8/-1->17->16 [1] -1/-1/-1->17->16 [2] 18/-1/-1->17->16 [3] 18/-1/-1->17->16 [4] 18/-1/-1->17->16 [5] -1/-1/-1->17->16 [6] 18/-1/-1->17->16 [7] 18/-1/-1->17->16
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_18 Trees [0] 19/-1/-1->18->17 [1] 19/26/-1->18->35 [2] 19/-1/-1->18->17 [3] 19/-1/-1->18->17 [4] 19/-1/-1->18->17 [5] 19/-1/-1->18->11 [6] 19/-1/-1->18->17 [7] 19/-1/-1->18->17
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_19 Trees [0] 20/-1/-1->19->18 [1] 20/10/-1->19->18 [2] -1/-1/-1->19->18 [3] 20/-1/-1->19->18 [4] 20/-1/-1->19->18 [5] 20/-1/-1->19->18 [6] -1/-1/-1->19->18 [7] 20/-1/-1->19->18
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_20 Trees [0] 21/-1/-1->20->19 [1] 21/-1/-1->20->19 [2] 21/28/-1->20->37 [3] 21/-1/-1->20->19 [4] 21/-1/-1->20->19 [5] 21/-1/-1->20->19 [6] 21/-1/-1->20->13 [7] 21/-1/-1->20->19
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_21 Trees [0] 22/-1/-1->21->20 [1] 22/-1/-1->21->20 [2] 22/12/-1->21->20 [3] -1/-1/-1->21->20 [4] 22/-1/-1->21->20 [5] 22/-1/-1->21->20 [6] 22/-1/-1->21->20 [7] -1/-1/-1->21->20
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_22 Trees [0] 23/-1/-1->22->21 [1] 23/-1/-1->22->21 [2] 23/-1/-1->22->21 [3] 23/30/-1->22->39 [4] 23/-1/-1->22->21 [5] 23/-1/-1->22->21 [6] 23/-1/-1->22->21 [7] 23/-1/-1->22->15
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_23 Trees [0] -1/-1/-1->23->22 [1] 16/-1/-1->23->22 [2] 16/-1/-1->23->22 [3] 16/14/-1->23->22 [4] -1/-1/-1->23->22 [5] 16/-1/-1->23->22 [6] 16/-1/-1->23->22 [7] 16/-1/-1->23->22
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_24 Trees [0] 25/-1/-1->24->16 [1] 25/-1/-1->24->31 [2] 25/-1/-1->24->31 [3] 25/-1/-1->24->31 [4] 25/8/-1->24->56 [5] 25/-1/-1->24->31 [6] 25/-1/-1->24->31 [7] 25/-1/-1->24->31
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_25 Trees [0] 26/-1/-1->25->24 [1] -1/-1/-1->25->24 [2] 26/-1/-1->25->24 [3] 26/-1/-1->25->24 [4] 26/40/-1->25->24 [5] -1/-1/-1->25->24 [6] 26/-1/-1->25->24 [7] 26/-1/-1->25->24
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_26 Trees [0] 27/-1/-1->26->25 [1] 27/-1/-1->26->18 [2] 27/-1/-1->26->25 [3] 27/-1/-1->26->25 [4] 27/-1/-1->26->25 [5] 27/10/-1->26->58 [6] 27/-1/-1->26->25 [7] 27/-1/-1->26->25
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_27 Trees [0] 28/-1/-1->27->26 [1] 28/-1/-1->27->26 [2] -1/-1/-1->27->26 [3] 28/-1/-1->27->26 [4] 28/-1/-1->27->26 [5] 28/42/-1->27->26 [6] -1/-1/-1->27->26 [7] 28/-1/-1->27->26
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_28 Trees [0] 29/-1/-1->28->27 [1] 29/-1/-1->28->27 [2] 29/-1/-1->28->20 [3] 29/-1/-1->28->27 [4] 29/-1/-1->28->27 [5] 29/-1/-1->28->27 [6] 29/12/-1->28->60 [7] 29/-1/-1->28->27
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_29 Trees [0] 30/-1/-1->29->28 [1] 30/-1/-1->29->28 [2] 30/-1/-1->29->28 [3] -1/-1/-1->29->28 [4] 30/-1/-1->29->28 [5] 30/-1/-1->29->28 [6] 30/44/-1->29->28 [7] -1/-1/-1->29->28
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_30 Trees [0] 31/-1/-1->30->29 [1] 31/-1/-1->30->29 [2] 31/-1/-1->30->29 [3] 31/-1/-1->30->22 [4] 31/-1/-1->30->29 [5] 31/-1/-1->30->29 [6] 31/-1/-1->30->29 [7] 31/14/-1->30->62
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_31 Trees [0] -1/-1/-1->31->30 [1] 24/-1/-1->31->30 [2] 24/-1/-1->31->30 [3] 24/-1/-1->31->30 [4] -1/-1/-1->31->30 [5] 24/-1/-1->31->30 [6] 24/-1/-1->31->30 [7] 24/46/-1->31->30
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_32 Trees [0] 33/48/-1->32->0 [1] 33/-1/-1->32->39 [2] 33/-1/-1->32->39 [3] 33/-1/-1->32->39 [4] 33/-1/-1->32->40 [5] 33/-1/-1->32->39 [6] 33/-1/-1->32->39 [7] 33/-1/-1->32->39
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_33 Trees [0] 34/16/-1->33->32 [1] -1/-1/-1->33->32 [2] 34/-1/-1->33->32 [3] 34/-1/-1->33->32 [4] 34/-1/-1->33->32 [5] -1/-1/-1->33->32 [6] 34/-1/-1->33->32 [7] 34/-1/-1->33->32
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_34 Trees [0] 35/-1/-1->34->33 [1] 35/50/-1->34->2 [2] 35/-1/-1->34->33 [3] 35/-1/-1->34->33 [4] 35/-1/-1->34->33 [5] 35/-1/-1->34->42 [6] 35/-1/-1->34->33 [7] 35/-1/-1->34->33
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_35 Trees [0] 36/-1/-1->35->34 [1] 36/18/-1->35->34 [2] -1/-1/-1->35->34 [3] 36/-1/-1->35->34 [4] 36/-1/-1->35->34 [5] 36/-1/-1->35->34 [6] -1/-1/-1->35->34 [7] 36/-1/-1->35->34
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_36 Trees [0] 37/-1/-1->36->35 [1] 37/-1/-1->36->35 [2] 37/52/-1->36->4 [3] 37/-1/-1->36->35 [4] 37/-1/-1->36->35 [5] 37/-1/-1->36->35 [6] 37/-1/-1->36->44 [7] 37/-1/-1->36->35
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_37 Trees [0] 38/-1/-1->37->36 [1] 38/-1/-1->37->36 [2] 38/20/-1->37->36 [3] -1/-1/-1->37->36 [4] 38/-1/-1->37->36 [5] 38/-1/-1->37->36 [6] 38/-1/-1->37->36 [7] -1/-1/-1->37->36
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_38 Trees [0] 39/-1/-1->38->37 [1] 39/-1/-1->38->37 [2] 39/-1/-1->38->37 [3] 39/54/-1->38->6 [4] 39/-1/-1->38->37 [5] 39/-1/-1->38->37 [6] 39/-1/-1->38->37 [7] 39/-1/-1->38->46
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_39 Trees [0] -1/-1/-1->39->38 [1] 32/-1/-1->39->38 [2] 32/-1/-1->39->38 [3] 32/22/-1->39->38 [4] -1/-1/-1->39->38 [5] 32/-1/-1->39->38 [6] 32/-1/-1->39->38 [7] 32/-1/-1->39->38
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_40 Trees [0] 41/-1/-1->40->49 [1] 41/-1/-1->40->47 [2] 41/-1/-1->40->47 [3] 41/-1/-1->40->47 [4] 41/32/-1->40->25 [5] 41/-1/-1->40->47 [6] 41/-1/-1->40->47 [7] 41/-1/-1->40->47
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_41 Trees [0] 42/-1/-1->41->40 [1] -1/-1/-1->41->40 [2] 42/-1/-1->41->40 [3] 42/-1/-1->41->40 [4] 42/48/-1->41->40 [5] -1/-1/-1->41->40 [6] 42/-1/-1->41->40 [7] 42/-1/-1->41->40
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_42 Trees [0] 43/-1/-1->42->41 [1] 43/-1/-1->42->51 [2] 43/-1/-1->42->41 [3] 43/-1/-1->42->41 [4] 43/-1/-1->42->41 [5] 43/34/-1->42->27 [6] 43/-1/-1->42->41 [7] 43/-1/-1->42->41
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_43 Trees [0] 44/-1/-1->43->42 [1] 44/-1/-1->43->42 [2] -1/-1/-1->43->42 [3] 44/-1/-1->43->42 [4] 44/-1/-1->43->42 [5] 44/50/-1->43->42 [6] -1/-1/-1->43->42 [7] 44/-1/-1->43->42
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_44 Trees [0] 45/-1/-1->44->43 [1] 45/-1/-1->44->43 [2] 45/-1/-1->44->53 [3] 45/-1/-1->44->43 [4] 45/-1/-1->44->43 [5] 45/-1/-1->44->43 [6] 45/36/-1->44->29 [7] 45/-1/-1->44->43
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_45 Trees [0] 46/-1/-1->45->44 [1] 46/-1/-1->45->44 [2] 46/-1/-1->45->44 [3] -1/-1/-1->45->44 [4] 46/-1/-1->45->44 [5] 46/-1/-1->45->44 [6] 46/52/-1->45->44 [7] -1/-1/-1->45->44
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_46 Trees [0] 47/-1/-1->46->45 [1] 47/-1/-1->46->45 [2] 47/-1/-1->46->45 [3] 47/-1/-1->46->55 [4] 47/-1/-1->46->45 [5] 47/-1/-1->46->45 [6] 47/-1/-1->46->45 [7] 47/38/-1->46->31
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_47 Trees [0] -1/-1/-1->47->46 [1] 40/-1/-1->47->46 [2] 40/-1/-1->47->46 [3] 40/-1/-1->47->46 [4] -1/-1/-1->47->46 [5] 40/-1/-1->47->46 [6] 40/-1/-1->47->46 [7] 40/54/-1->47->46
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_48 Trees [0] 49/56/-1->48->32 [1] 49/-1/-1->48->55 [2] 49/-1/-1->48->55 [3] 49/-1/-1->48->55 [4] 49/-1/-1->48->41 [5] 49/-1/-1->48->55 [6] 49/-1/-1->48->55 [7] 49/-1/-1->48->55
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_49 Trees [0] 50/40/-1->49->48 [1] -1/-1/-1->49->48 [2] 50/-1/-1->49->48 [3] 50/-1/-1->49->48 [4] 50/-1/-1->49->48 [5] -1/-1/-1->49->48 [6] 50/-1/-1->49->48 [7] 50/-1/-1->49->48
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_50 Trees [0] 51/-1/-1->50->49 [1] 51/58/-1->50->34 [2] 51/-1/-1->50->49 [3] 51/-1/-1->50->49 [4] 51/-1/-1->50->49 [5] 51/-1/-1->50->43 [6] 51/-1/-1->50->49 [7] 51/-1/-1->50->49
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_51 Trees [0] 52/-1/-1->51->50 [1] 52/42/-1->51->50 [2] -1/-1/-1->51->50 [3] 52/-1/-1->51->50 [4] 52/-1/-1->51->50 [5] 52/-1/-1->51->50 [6] -1/-1/-1->51->50 [7] 52/-1/-1->51->50
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_52 Trees [0] 53/-1/-1->52->51 [1] 53/-1/-1->52->51 [2] 53/60/-1->52->36 [3] 53/-1/-1->52->51 [4] 53/-1/-1->52->51 [5] 53/-1/-1->52->51 [6] 53/-1/-1->52->45 [7] 53/-1/-1->52->51
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_53 Trees [0] 54/-1/-1->53->52 [1] 54/-1/-1->53->52 [2] 54/44/-1->53->52 [3] -1/-1/-1->53->52 [4] 54/-1/-1->53->52 [5] 54/-1/-1->53->52 [6] 54/-1/-1->53->52 [7] -1/-1/-1->53->52
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_54 Trees [0] 55/-1/-1->54->53 [1] 55/-1/-1->54->53 [2] 55/-1/-1->54->53 [3] 55/62/-1->54->38 [4] 55/-1/-1->54->53 [5] 55/-1/-1->54->53 [6] 55/-1/-1->54->53 [7] 55/-1/-1->54->47
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_55 Trees [0] -1/-1/-1->55->54 [1] 48/-1/-1->55->54 [2] 48/-1/-1->55->54 [3] 48/46/-1->55->54 [4] -1/-1/-1->55->54 [5] 48/-1/-1->55->54 [6] 48/-1/-1->55->54 [7] 48/-1/-1->55->54
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_56 Trees [0] 57/-1/-1->56->48 [1] 57/-1/-1->56->63 [2] 57/-1/-1->56->63 [3] 57/-1/-1->56->63 [4] 57/24/-1->56->-1 [5] 57/-1/-1->56->63 [6] 57/-1/-1->56->63 [7] 57/-1/-1->56->63
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_57 Trees [0] 58/-1/-1->57->56 [1] -1/-1/-1->57->56 [2] 58/-1/-1->57->56 [3] 58/-1/-1->57->56 [4] 58/-1/-1->57->56 [5] -1/-1/-1->57->56 [6] 58/-1/-1->57->56 [7] 58/-1/-1->57->56
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_58 Trees [0] 59/-1/-1->58->57 [1] 59/-1/-1->58->50 [2] 59/-1/-1->58->57 [3] 59/-1/-1->58->57 [4] 59/-1/-1->58->57 [5] 59/26/-1->58->-1 [6] 59/-1/-1->58->57 [7] 59/-1/-1->58->57
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_59 Trees [0] 60/-1/-1->59->58 [1] 60/-1/-1->59->58 [2] -1/-1/-1->59->58 [3] 60/-1/-1->59->58 [4] 60/-1/-1->59->58 [5] 60/-1/-1->59->58 [6] -1/-1/-1->59->58 [7] 60/-1/-1->59->58
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_60 Trees [0] 61/-1/-1->60->59 [1] 61/-1/-1->60->59 [2] 61/-1/-1->60->52 [3] 61/-1/-1->60->59 [4] 61/-1/-1->60->59 [5] 61/-1/-1->60->59 [6] 61/28/-1->60->-1 [7] 61/-1/-1->60->59
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_61 Trees [0] 62/-1/-1->61->60 [1] 62/-1/-1->61->60 [2] 62/-1/-1->61->60 [3] -1/-1/-1->61->60 [4] 62/-1/-1->61->60 [5] 62/-1/-1->61->60 [6] 62/-1/-1->61->60 [7] -1/-1/-1->61->60
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_62 Trees [0] 63/-1/-1->62->61 [1] 63/-1/-1->62->61 [2] 63/-1/-1->62->61 [3] 63/-1/-1->62->54 [4] 63/-1/-1->62->61 [5] 63/-1/-1->62->61 [6] 63/-1/-1->62->61 [7] 63/30/-1->62->-1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO Grank_63 Trees [0] -1/-1/-1->63->62 [1] 56/-1/-1->63->62 [2] 56/-1/-1->63->62 [3] 56/-1/-1->63->62 [4] -1/-1/-1->63->62 [5] 56/-1/-1->63->62 [6] 56/-1/-1->63->62 [7] 56/-1/-1->63->62
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm->buffSizes[NCCL_PROTO_LL]: 524288
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm->buffSizes[NCCL_PROTO_LL128]: 4915200
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm->buffSizes[NCCL_PROTO_SIMPLE]: 4194304
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm->p2pChunkSize: 131072
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO comm->p2pnChannels: 8, comm->p2pnChannelsPerPeer: 1
inspur-gpu-server-15:2563676:2563678 [0] NCCL INFO 8 coll channels, 0 collnet channels, 0 nvls channels, 8 p2p channels, 1 p2p channels per peer
run.sh: 第 11 行: 2563676 段错误               (核心已转储) LD_LIBRARY_PATH="$SCRIPT_DIR/../build/lib/" NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=MOCK_ROOT,MOCK NCCL_CROSS_NIC=1 NCCL_TOPO_FILE=$SCRIPT_DIR/topo_8A100_4CX6.xml NCCL_NVLS_ENABLE=0 NCCL_NUM_MOCK_GPU=64 NCCL_NUM_MOCK_NODE=8 $1 ./MultiDevOneP $CORE_FILE

If I run the script with another xml file (e.g. topo_4A100_2CX6.xml), there is a different bug:

NCCL version 2.20.5-MockNCCL+cuda12.6

inspur-gpu-server-15:2570998:2571000 [0] misc/cudawrap.cc:34 NCCL WARN Cuda failure 'no CUDA-capable device is detected'

inspur-gpu-server-15:2570998:2571000 [0] init.cc:270 NCCL WARN Cuda failure 3 'initialization error'
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Mocking start: mock_nNodes(8), mock_nRanks(64), mock_nRanksPerNode(8).
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO info_index(0): 0x7f08f0099440, rank: 0, cudaDev: 0, nvmlDev: 0, gdrSupport: 1,hostHash: 10710501475456838185, pidHash: 13668843625570461005, busId: ffffffffffffffff:ff:ff.f, cudaCompCap: -1.
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO comm_index(0): 0x55eaf7ae4c00, node: 0, nNodes: 0, localRank: 0, localRanks: 0, maxLocalRanks: 0, intraRank: 0, intraRanks: 0, ncclCollNet: 0.
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO nNodes: 64.
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO comm_index(0): 0x55eaf7ae4c00, node: 0, nNodes: 0, localRank: 0, localRanks: 0, maxLocalRanks: 0, intraRank: 0, intraRanks: 1, ncclCollNet: 0.
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO After ncclTopoGetSystem, system_info{nodes[NET].count: 3, nodes[GPU].count: 4, maxBW: 0.000000, totalBW: 0.000000}.
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO (*system)->nodes[GPU].nodes[0].id: 0000:13:00.0, gdr = 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO (*system)->nodes[GPU].nodes[1].id: 0000:19:00.0, gdr = 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO (*system)->nodes[GPU].nodes[2].id: 0000:48:00.0, gdr = 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO (*system)->nodes[GPU].nodes[3].id: 0000:4d:00.0, gdr = 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO After ncclTopoGetSystem, system_info{nodes[NET].count: 3, nodes[GPU].count: 4, maxBW: 0.000000, totalBW: 0.000000}.
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO (*system)->nodes[GPU].nodes[0].id: 0000:13:00.0, gdr = 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO (*system)->nodes[GPU].nodes[1].id: 0000:19:00.0, gdr = 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO (*system)->nodes[GPU].nodes[2].id: 0000:48:00.0, gdr = 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO (*system)->nodes[GPU].nodes[3].id: 0000:4d:00.0, gdr = 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO After ncclTopoSearchInit, system_info{nodes[NET].count: 3, nodes[GPU].count: 4, maxBW: 24.000000, totalBW: 240.000000}.
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO GPU/13000 :GPU/13000 (0/5000.000000/LOC) GPU/19000 (2/240.000000/NVL) GPU/48000 (2/240.000000/NVL) GPU/4D000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (3/24.000000/PHB) NET/0 (5/24.000000/PHB) NET/1 (4/24.000000/PXB) NET/2 (6/24.000000/PXN)
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO GPU/19000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (0/5000.000000/LOC) GPU/48000 (2/240.000000/NVL) GPU/4D000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (3/24.000000/PHB) NET/0 (5/24.000000/PHB) NET/1 (4/24.000000/PXB) NET/2 (6/24.000000/PXN)
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO GPU/48000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (2/240.000000/NVL) GPU/48000 (0/5000.000000/LOC) GPU/4D000 (2/240.000000/NVL) NVS/0 (1/240.000000/NVL) CPU/0 (3/24.000000/PHB) NET/0 (5/24.000000/PHB) NET/1 (6/24.000000/PXN) NET/2 (4/24.000000/PXB)
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO GPU/4D000 :GPU/13000 (2/240.000000/NVL) GPU/19000 (2/240.000000/NVL) GPU/48000 (2/240.000000/NVL) GPU/4D000 (0/5000.000000/LOC) NVS/0 (1/240.000000/NVL) CPU/0 (3/24.000000/PHB) NET/0 (5/24.000000/PHB) NET/1 (6/24.000000/PXN) NET/2 (4/24.000000/PXB)
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO NET/0 :GPU/13000 (5/24.000000/PHB) GPU/19000 (5/24.000000/PHB) GPU/48000 (5/24.000000/PHB) GPU/4D000 (5/24.000000/PHB) CPU/0 (2/24.000000/PHB) NET/0 (0/5000.000000/LOC) NET/1 (5/24.000000/PHB) NET/2 (5/24.000000/PHB)
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO NET/1 :GPU/13000 (4/24.000000/PXB) GPU/19000 (4/24.000000/PXB) GPU/48000 (6/24.000000/PHB) GPU/4D000 (6/24.000000/PHB) CPU/0 (3/24.000000/PHB) NET/0 (5/24.000000/PHB) NET/1 (0/5000.000000/LOC) NET/2 (6/24.000000/PHB)
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO NET/2 :GPU/13000 (6/24.000000/PHB) GPU/19000 (6/24.000000/PHB) GPU/48000 (4/24.000000/PXB) GPU/4D000 (4/24.000000/PXB) CPU/0 (3/24.000000/PHB) NET/0 (5/24.000000/PHB) NET/1 (6/24.000000/PHB) NET/2 (0/5000.000000/LOC)
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO s->nodes[GPU].nodes[0].gpu.rank: 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO s->nodes[GPU].nodes[1].gpu.rank: 1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO s->nodes[GPU].nodes[2].gpu.rank: 2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO s->nodes[GPU].nodes[3].gpu.rank: 3
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO comm->compCap: 80, comm->minCompCap: 80, comm->maxCompCap: 80
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO mock_comm[0].compCap: 80, mock_comm[0].minCompCap: 80, mock_comm[0].maxCompCap: 80
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/0 GPU/3 GPU/1 GPU/2 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/2 GPU/1 GPU/3 GPU/0 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/0 GPU/1 GPU/2 GPU/3 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/2 GPU/3 GPU/0 GPU/1 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock ring Graph[node: 0]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/0 GPU/3 GPU/1 GPU/2 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/2 GPU/1 GPU/3 GPU/0 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock tree Graph[node: 0]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/0 GPU/1 GPU/2 GPU/3 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/2 GPU/3 GPU/0 GPU/1 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock ring Graph[node: 1]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/8 GPU/11 GPU/9 GPU/10 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/10 GPU/9 GPU/11 GPU/8 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock tree Graph[node: 1]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/8 GPU/9 GPU/10 GPU/11 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/10 GPU/11 GPU/8 GPU/9 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock ring Graph[node: 2]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/16 GPU/19 GPU/17 GPU/18 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/18 GPU/17 GPU/19 GPU/16 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock tree Graph[node: 2]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/16 GPU/17 GPU/18 GPU/19 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/18 GPU/19 GPU/16 GPU/17 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock ring Graph[node: 3]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/24 GPU/27 GPU/25 GPU/26 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/26 GPU/25 GPU/27 GPU/24 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock tree Graph[node: 3]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/24 GPU/25 GPU/26 GPU/27 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/26 GPU/27 GPU/24 GPU/25 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock ring Graph[node: 4]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/32 GPU/35 GPU/33 GPU/34 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/34 GPU/33 GPU/35 GPU/32 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock tree Graph[node: 4]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/32 GPU/33 GPU/34 GPU/35 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/34 GPU/35 GPU/32 GPU/33 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock ring Graph[node: 5]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/40 GPU/43 GPU/41 GPU/42 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/42 GPU/41 GPU/43 GPU/40 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock tree Graph[node: 5]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/40 GPU/41 GPU/42 GPU/43 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/42 GPU/43 GPU/40 GPU/41 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock ring Graph[node: 6]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/48 GPU/51 GPU/49 GPU/50 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/50 GPU/49 GPU/51 GPU/48 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock tree Graph[node: 6]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/48 GPU/49 GPU/50 GPU/51 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/50 GPU/51 GPU/48 GPU/49 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock ring Graph[node: 7]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 2, bw 24.000000/24.000000, type NVL/PXB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/56 GPU/59 GPU/57 GPU/58 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/58 GPU/57 GPU/59 GPU/56 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO ------ Mock tree Graph[node: 7]: ------
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Pattern 1, crossNic 1, nChannels 2, bw 48.000000/24.000000, type NVL/PHB, sameChannels 0
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  0 : NET/1 GPU/56 GPU/57 GPU/58 GPU/59 NET/2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO  1 : NET/2 GPU/58 GPU/59 GPU/56 GPU/57 NET/1
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO -------------------------------

inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO comm->nNodes: 8
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO treeGraph.nChannels: 2, ringGraph.nChannels: 2
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO Channel 00/04 :     0    3    1    2   10    9   11    8   16   19   17   18   26   25   27   24   32   35   33   34   42   41   43   40   48   51   49   50   58   57   59   56    0    3    1    2   10    9   11    8   16   19   17   18   26   25   27   24   32   35   33   34   42   41   43   40   48   51   49   50   58   57   59   56

inspur-gpu-server-15:2570998:2571000 [0] graph/rings.cc:58 NCCL WARN Error : ring 0 does not contain rank 4
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO graph/connect.cc:507 -> 3
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO init.cc:1492 -> 3
inspur-gpu-server-15:2570998:2571000 [0] NCCL INFO init.cc:1913 -> 3
inspur-gpu-server-15:2570998:2570998 [0] NCCL INFO group.cc:418 -> 3
inspur-gpu-server-15:2570998:2570998 [0] NCCL INFO group.cc:95 -> 3
inspur-gpu-server-15:2570998:2570998 [0] NCCL INFO init.cc:2254 -> 3
Failed, NCCL error MultiDevOneP.cc:134 'internal error - please report this issue to the NCCL developers'

Besides, if run in the VM, there will be another error:

NCCL version 2.20.5-MockNCCL+cuda12.9
localhost:799:801 [0] NCCL INFO Mocking start: mock_nNodes(1), mock_nRanks(1), mock_nRanksPerNode(1).
localhost:799:801 [0] NCCL INFO info_index(0): 0x7fa4fc0038d0, rank: 0, cudaDev: 0, nvmlDev: 0, gdrSupport: 1,hostHash: 8982114052393301127, pidHash: 7559370754094648493, busId: ffffffffffffffff:ff:ff.f, cudaCompCap: -1.
run.sh: line 11:   799 Bus error               LD_LIBRARY_PATH="$SCRIPT_DIR/../build/lib/" NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=MOCK_ROOT,MOCK NCCL_CROSS_NIC=1 NCCL_TOPO_FILE=$SCRIPT_DIR/topo_8A100_4CX6.xml NCCL_NVLS_ENABLE=0 NCCL_NUM_MOCK_GPU=1 NCCL_NUM_MOCK_NODE=1 $1 ./MultiDevOneP $CORE_FILE

Expected Behavior

Should run successfully

Actual Behavior

Error

Environment

CUDA: 12.6, installed in /usr/local/cuda
OS: 6.2.0-39-generic #40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions